Big Crisis Data

14453162721_390ac868e2_o

For the last three years humanitarian data has been a huge part of our consulting work, our bread and butter so to speak. It’s something we’re deeply passionate about, which is why we got super excited when we saw a new book specifically on data science in crisis situations was being released this July.

We’ve had so much fun reading How Not to Network a Nation with all of you, we decided to take advantage of this opportunity to continue the book club with Big Crisis Data. (Our review of How Not to Network will be online next week)

This promises to be a wholly different reading experience, less narrative, more detailed technical approaches. Not entirely sure if it will work as a group reading experience, but excited to see what the response is!

Like before, we’re kicking off this selection by giving away a free copy … which you’ll want to try to take advantage of because Big Crisis Data does have a hefty sticker price!

Data Beach Reads

exversion_beach_read

For the longest time my summer reading was an academic work titled something like “Global Capital Markets of the Czech Republic” It was about 8 inches by 11 inches, a gazillion pages with a teal hard cover. It was filled with amusing and bizarre stories about the policy blunders that took the Czech Republic from Communist to Capitalist. It was wonderful.

There’s still sand from the beach stuck between its pages.

There’s a point just after high school where the concept of summer reading goes from stuff that’s designed to educate you, to stuff that’s designed to be rather mindless.

So in the spirit of being oddballs we thought it might be fun to invite the internet to our book club:)

We’ll be reading Benjamin Peters‘s How Not to Network a Nation. And …oh… we’re going to give away a copy of the book too! Just in case you were wondering😀 Follow the link below to enter!

Win a copy of How Not To Network a Nation

We’ll also been posting regular status updates as we read on GoodReads. Come follow along!

How to Accelerate Smartly

Untitled design
Accelerating in business is like accelerating in traffic: you could get where you want to go faster but you’re much more likely to slam into a wall and ruin your chances of getting there at all. Last month I gave a talk at StartupBus finals about making decisions that were wrong for our actual business but were nevertheless attractive because we assumed they would make us seem like a more legitimate startup. We wasted a lot of time chasing validation at Exversion‘s expense and it finally got to the point where I started asking myself who really benefitted from pushing this idea that all startups need to take the same path into the world of venture capital. There are so many ways to be successful, I cannot overstate how important it is to pick an approach that fits your business and your industry.

Accelerators and incubators are generally thought of as the first step on the road to startup legitimacy. So much so that founders seem willing to give up a slice of their equity to push their business into a program without any real sense of what they expect the program to do for them. Joining an accelerator is the startup equivalent of going to college: you do it because you’re supposed to and because you’re raised to think that it will improve your odds going forward.

Unlike college, there’s very little evidence that it will.

Still, there are so many different kinds of programs now there’s bound to be one that fits your business. Now the two programs we got the most out of are accepting new applications. And guess what? Neither one of them takes equity. I have often joked that we are the masters of accelerators that give you no money. If you think the money is the most important part of joining an accelerator you won’t be in business much longer. I’m here to convince you that you don’t need the money, and to be frank you don’t even want the money either.

Friends of Ebay

Internally Friends of Ebay is always abbreviated as FoB, even though obviously the correct acronym is FoE. Fitting because your relationship with Ebay is kind of awkwardly undefined. Essentially FoB is just Ebay subletting their unused office space to startups that interest them. It’s a great program in that you get free, fully equipped office space including printers, scanners, conference rooms, a kitchen and private bathrooms. Lunch is free on Tuesdays, you can book a free massage on Mondays and you can usually secure the event space for your own events free of charge. Every month there would be a little ice cream party for people who had birthdays that month. If a speaker drops by to address Ebay’s workers, you get to be there too.

However there’s also a careful and strict separation between you and Ebay, perhaps for everyone’s benefit. Your key card will only give you access to certain parts of the offices at certain times, which means you can’t raid the snacks in Ebay’s kitchen or blow off some steam with a game of Street Fighter on their arcade machine. The isolation makes it hard to mingle with Ebay staff and find that mentorship they promise but you are invited to their Christmas party.

Instead you bond pretty tightly with the other startups because you are a little colony all to yourself. It ends of being a tiny WeWork that you get completely 100% for free.

First Growth Venture Network

We used to jokingly refer to FGVN as group therapy for startups. Every month we would all gather and talk about our problems, ask for advice and just generally commiserate with our colleagues. FGVN was exactly the right thing for us, at exactly the right time. We would come in feeling beaten down and lost but leave with a renewed sense of determination and purpose.

Plus, you know, they always fed us well.

FVGN days are full days, the first couple of hours are private discussion as a founder group, then you move to the event space where an elite mix of alumni, entrepreneurs, venture capitalists and tech journalists have gathered for a panel discussion from an even more elite group of entrepreneurs, VCs and executives. I mean, where else can you eat strawberry shortcake with Alan Patricoff while listening to David Karp talk about the early days of Tumblr? Or the time we had a breakout session with David Draiman from Disturbed and I kept catching myself humming “Down with the Sickness” while he talked about his new app? Or the time Jacek turned around to hold the door for someone and it turned out to be one of the founding investors in Spotify?

Surreal things happen at First Growth.

But beyond the extraordinary experiences the ordinary experiences of FGVN are pretty awesome as well: the networking, free consultations from fundraising experts on everything from pitching to deal structure. Actually I think the most valuable thing about FGVN is the ability to just talk openly and honestly about what’s really going on. You don’t realize how valuable the opportunity to have feedback from someone who has no agenda is until you’re without it.

FGVN is also just a great opportunity to get to know the startup team at Lowenstein Sandler. The program is free because they want your business, but to be honest, they are among the best and most well connected startup lawyers around. By the end of the program we wanted to give them our business too!

A necessary trade-off here is that because Lowenstein makes their money providing legal services for VC deals, the content of FGVN meetings inevitably always seems to be about fundraising. Even when the formal topic is something else, discussions ultimately work their way back to fundraising. The audience is heavily skewed with VCs and every panel has at least one VC on it who seems to take every question and make it somehow about their investing strategy. Josh Kopelman is invited to speak A LOT and sometimes it seems there are more inside jokes and banter between him and Ed Zimmerman then actual advice for entrepreneurs.

The take away here is that bootstrapped startups will not get as much out of FGVN as startups actively looking for investors.

Apply Soon

Both programs are now accepting applications for their next class. These programs are open to startups at all stages (idea stage, pre-money, funded, etc) and take absolutely no equity. They were great experiences for us, so we encourage you to apply today!

Proposal: Building Scalable Data Infrastructure Without Geeks

Original image by Tom Carmony

Every year there’s a technical conference just for the nonprofit community run by the Nonprofit Technology Enterprise Network. This year we had so much fun talking data with so many great organizations we submitted a session idea to the community for consideration at the 2016 conference. We’re calling it: Building Scalable Data Infrastructure Without Geeks

The most important decisions about an organization’s data are often made before the organization has enough money to hire an expert. Most of the advice small, cash strapped nonprofits get on how to manage their data is “buy this piece of software”, and yet it is possible to set up a scalable, developer/analyst friendly infrastructure MacGyver style from tools the nontechnical staff knows.

If you like this idea, we encourage you to vote for it. If it gets picked we’ll publish a companion blog post here with plenty of resources and advanced topics.

Guide to Data Science Competitions

“Don't worry about a thing,every littleSummer is finally here and so are the long form virtual hackathons. Unlike a traditional hackathon, which focus on what you can build in one place in one limited time span, virtual hackathons typically give you a month or more to work from where ever you like.

And for those of us who love data, we are not left behind. There are a number of data science competitions to choose from this summer. Whether it’s a new Kaggle challenge (which are posted year round) or the data science component of Challenge Post’s Summer Jam Series, there are plenty of opportunities to spend the summer either sharpening or showing off your skills.

The Landscape: Which Competitions are Which?

  • Kaggle
    Kaggle competitions have corporate sponsors that are looking for specific business questions answered with their sample data. In return, winners are rewarded handsomely, but you have to win first.
  • Summer Jam
    Challenge Post’s Summer Jam Open Data Mashup runs in June and focuses on mashing up multiple open data sets (use the Data Search Engine to find some great options). Competitors are not asked to answer a specific question, so this competition is well suited for beautiful experiments in visualizing data.
  • DrivenData
    Like Kaggle, DrivenData competitions have a sponsor with a specific research question and specific sample data. DrivenData sponsors, however, tend to be more social impact minded.

Over the months we’ve posted many great links on winning data science competitions through our mailing list, but if you’ve missed them here’s a list of the best resources, advice and tutorials:

Choosing Your Weapons
DATA SCIENCE WARS: R VS. PYTHON
http://101.datascience.community/2015/05/12/data-science-wars-r-vs-python/

3 Must-Ask Questions Before Choosing That Machine Learning Algorithm!
http://www.analyticbridge.com/profiles/blogs/wait-why-are-you-using-that-algorithm

Dictionary of Algorithms and Data Structures
http://xlinux.nist.gov/dads/

Fast Non-Standard Data Structures for Python
http://kmike.ru/python-data-structures/

A list of assorted tools and such mentioned and used During DSSG 2014
https://hackpad.com/A-list-of-assorted-tools-and-such-mentioned-and-used-During-DSSG-2014-wl5QgF3LsSU

Data Science Resources
https://github.com/jonathan-bower/DataScienceResources

12 Best Free Ebooks for Machine Learning
http://designimag.com/best-free-machine-learning-ebooks/

Top 10 data mining algorithms in plain English
http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-english/

Python Shortcuts
The Top Mistakes Developers Make When Using Python for Big Data Analytics
https://www.airpair.com/python/posts/top-mistakes-python-big-data-analytics

11 Python Libraries You Might Not Know
http://blog.yhathq.com/posts/11-python-libraries-you-might-not-know.html

iPython Notebook Gallery (includes pandas cheat sheet)
http://nb.bianp.net/sort/views/

Visualizations
D3.js Step by Step
http://zeroviscosity.com/category/d3-js-step-by-step

For inspiration, check this index of visualization types for visualizing text
http://textvis.lnu.se/

Gestalt Principles for Data Visualization
http://emeeks.github.io/gestaltdataviz/section1.html

Advice From Past Competitors
Machine learning best practices we’ve learned from hundreds of competitions – Ben Hamner of Kaggle
https://www.youtube.com/watch?v=9Zag7uhjdYo

LESSONS LEARNED FROM THE HUNT FOR PROHIBITED CONTENT ON KAGGLE
http://mlwave.com/lessons-from-avito-prohibited-content-kaggle/

What I Learned From The Kaggle Criteo Data Science Odyssey
https://medium.com/@chris_bour/what-i-learned-from-the-kaggle-criteo-data-science-odyssey-b7d1ba980e6

6 Tricks I Learned From The OTTO Kaggle Challenge
https://medium.com/@chris_bour/6-tricks-i-learned-from-the-otto-kaggle-challenge-a9299378cd61

How to use R, H2O, and Domino for a Kaggle competition
http://blog.dominodatalab.com/using-r-h2o-and-domino-for-a-kaggle-competition/

Competing in a data science contest without reading the data
http://blog.mrtz.org/2015/03/09/competition.html

KAGGLE ENSEMBLING GUIDE
http://mlwave.com/kaggle-ensembling-guide/

Building a Search Engine for Data

7557559670_aa7576525a_o

I don’t know if you’ve heard, but last month we built a search engine that is currently indexing open data sites around the world.

In the current startup ecosystem, that seems so retro. We’ve all basically accepted Google as our lord and master, the tastemaker to end all tastemakers. Why bother investing time and energy when you will never be able to compete with such a dominate player?

But as anyone who has dipped their toes in the waters of SEO will tell you, Google’s algorithms judge quality by making a bunch of core assumptions about what useful internet content is supposed to look like. These assumptions over emphasize pages with lots of high quality text (blogs) and under emphasize pages with duplicate structure and low amounts of text (like … for example, catalogues).

That means using Google to try to figure out which open data site has the data you need is practically impossible. Suppose you are a reporter living in Brooklyn trying to find data on animal sacrifice (hey it happens), such data undoubtably exists either through 311 calls or police reports but the question is where do you look for it? You could search the national level through Data.gov, the state level through Data.ny.gov, on the city level through NYC Open Data, or you could search any number of repositories run by informal local initiatives such as BetaNYC’s data repository.

But that assumes that you know that any of those sites exist in the first place.

And here’s what would happen if you ran those searches:

– Data.gov’s search returns no results for “animal sacrifice”

– Data.ny.gov returns data on the number of horses injured or killed at racetracks

– NYC Open Data returns the Brooklyn Public Library Catalogue (wtf?) and a mysteriously named “Multiagency Permits” dataset.

– BetaNYC returns no results for “animal sacrifice”

Exversion’s search engine returns reports from animal services, animal shelters, animal care enforcement, etc. And the best part is most of this data links back to Data.gov … the first site we searched that told us it had no data that fit our query.

Building a Search Engine is Hard

Building search engines isn’t just passe, it’s freaking hard. And gets infinitely harder the more the web grows. The processing power required to crawl, essentially downloading and parsing billions of documents is pretty intense. Then storing that data, indexing it and running queries over those billions and billions of documents requires more resources than most startups can manage. Even with valuations swelling up like they are.

It’s not easy. It’s no longer the low hanging fruit that it was in the nineties when the web was smaller with fewer devices generating content.

Fortunately for us we only need to crawl a very specific subset of the web, which means we can slash the resources needed to complete this task by making a few specific assumptions:

  • The information we want is hosted on specific sites
  • These sites have a common API and structure
  • New information is not added to these sites on a daily basis and old information is rarely if ever updated

Anyone with a computer can create content Google might want to index. Not everyone can run an open data portal. Likewise the number of companies that provide ways of publishing content on the internet is infinite and increasing exponentially while the number of companies providing ways to publish data are only a handful. And of that handful there are two major players who dominate the market: Socrata and CKAN.

All Socrata instances come with a sitemap. It’s not obvious where this is but their robots.txt will give you the link. From there we just careful follow each link and scrape the title and description from each listing of data. It works well because this is what sitemaps are designed to do.

CKAN makes it even easier because CKAN has a pretty decent API. One of the endpoints allows the user to search through the data available on that instance. It requires no authentication and if you don’t provide a query it will return …. well everything.

Even better package search returns all the metadata. So with pagination we can scrape hundreds of thousands of datasets in minutes.

Outliers, Special Snowflakes, and Scrapers As Service

Of course not EVERY site uses either Socrata or CKAN. Most of the world’s open scientific data is on proprietary platforms, Most of the world’s open Geodata is on GeoNode instances. Before we could consider the challenge completed we had to figure out a way to handle sites that didn’t share a standard structure.

I’m awfully fond of Scraping As Service because I’m really not fond of writing individual scrapers every time I need to grab data from a site. But as useful as they are, these companies always seem to have a hard life. ScraperWiki decided they make more money from consulting and closed down their interface for individual devs. Import.io raised venture funding and inexplicably stopped working. 3Taps curates what they will scrape. 80legs tends to accidentally DDOS sites.

About a year ago someone showed me Kimono and there are a few really interesting features that stood out. While Import.io will allow you just to control scrapers via API, you can actually access your data via API with Kimono. You can schedule scrapes to occur regularly. And best of all, Kimono has webhooks.

Which means we could set up Kimono to scrape our outliers and when the scrape is complete, it sends the data directly to our servers via webhook.

Sweet!

Indexing All The Data

The technology is stable, so right now we’re just adding as many sites to our crawl list as possible. We want to prioritize the sites being crawled to make sure that the ones you are most interested in searching get indexed first. You can check out what we’ve indexed so far and what’s on the queue right here. Feel free to add your favorite data site if you don’t see it there.

(photo by pleuntje)

A Tale of Two Conferences

16617251877_bce5bfd206_z

This year we didn’t go to SXSW.

Instead Exversion went to NTEN’s Nonprofit Technical Conference in Austin, basically the week before the techies descended and bought out sixth street for their private parties.

It would have been easy to stick around in town for SXSW, several NTEN people did, but in the end I was glad we didn’t. I love SXSW, but this week I was flabbergasted by the quality of leads that came to us through the smaller, more specialized conference. Had we stayed, I would just now be following up on those connections and as a result the momentum might have been lost.

NTEN is basically paying for itself in clients and partners, which really surprised me. I’m more used to the benefits of conferences being more intangible.

Last year we had an extremely productive SXSW, filled with glimmers of that unique SXSW magic the organizers have practically trademarked. I ran into a guy I had been trying to set up a meeting with for three months on line at the Spotify House (he was behind me!) and we ended up having a critical meeting right there on the street. I grabbed beers with some Ushahidi devs and got great free advice on how to structure an open source consultancy. I met one of the cofounders of Infogram on the dance floor. I got frequently– and inexplicably– mistaken for Anna Kendrick (Was she even in town?).

But we didn’t bring in any new clients or new users. The opportunities that came from SXSW came months later, where as the opportunities from NTEN started pouring in almost as soon as we landed back home.

And some people would look at that and say that large conferences like SXSW are not worth the trouble, but really I think the truth is these are two distinctly different types of conferences.

A lot of startup people think they’re going to SXSW to sell something– get new customers, get investors, launch their hot new whatever– but think about this for a minute: who comes to SXSW looking to buy things? Who looks at their business problem and thinks to themselves “I’m sure I can find a vendor with a solution to this at SXSW”?

No one. There may be some investors looking to scout startups, but actual deals are few and far between. If the crowd at SVB’s club house is any indication, most VC firms are sending associates rather than partners.

No, SXSW peddles in influence and novelty. People go hoping to build connections with influential people. And the influential people go looking to build connections with more influential people. People go to taste test the hot new thing, but only if the hot new thing is given to them for free. Nobody is going home with a new contract, a new client, or a big investment. Successful hustlers come home with hundreds of new contacts, maybe one or two of those will turn into something real.

Basically all developer conferences are like this. People come to learn, and to meet people, not to buy things.

NTEN, on the other hand, is a conference that people attend specifically to buy things. Thousands of organizations send representatives to find solutions to their technical problems. One such colleague told me that he had received specific instructions from his boss to come home with either a great product they could buy or a great consultant they could hire.

At the same time, a couple of years ago I attended another conference that peddled in influence. Thought leaders galore! One of the many people I met there was an entrepreneur running a small technical business in the developing world. Nobody was paying him much attention, he wasn’t anyone’s prestige catch.

Today things are completely different: he’s a TED fellow, a VC, and was named to one of those fancy “30 under 30” lists. When Exversion was working on ebola data for the UN this summer, we were able to collaborate. He turned out to be one of those contacts that paid for the conference, but it took years for that connection to bring returns.

The moral of the story is it’s really critical to research conferences before you buy a badge. SXSW and events of that nature can offer fantastic longterm benefits, but if you need immediate results you’ll probably leave feeling like you’ve wasted your money. It’s not difficult to figure out whether a conference will be a buyers conference or an influence builder: look at the speakers, the panel topics, the branding. Ask yourself: who is going to buy a ticket to attend this and what will they hope to get out of it?

And the types of conferences you’re attending really should be strategic. Many people see SXSW as a conference to “launch” … I could not disagree more. SXSW is a valuable conference a year or more before you launch. It will connect you with journalists and hustlers whose networks and resources could be game changing. But in order to get access to those advantages you need to develop the relationship in a natural way, over time.

(image credit: Anthony Quintano)