Building a Search Engine for Data

7557559670_aa7576525a_o

I don’t know if you’ve heard, but last month we built a search engine that is currently indexing open data sites around the world.

In the current startup ecosystem, that seems so retro. We’ve all basically accepted Google as our lord and master, the tastemaker to end all tastemakers. Why bother investing time and energy when you will never be able to compete with such a dominate player?

But as anyone who has dipped their toes in the waters of SEO will tell you, Google’s algorithms judge quality by making a bunch of core assumptions about what useful internet content is supposed to look like. These assumptions over emphasize pages with lots of high quality text (blogs) and under emphasize pages with duplicate structure and low amounts of text (like … for example, catalogues).

That means using Google to try to figure out which open data site has the data you need is practically impossible. Suppose you are a reporter living in Brooklyn trying to find data on animal sacrifice (hey it happens), such data undoubtably exists either through 311 calls or police reports but the question is where do you look for it? You could search the national level through Data.gov, the state level through Data.ny.gov, on the city level through NYC Open Data, or you could search any number of repositories run by informal local initiatives such as BetaNYC’s data repository.

But that assumes that you know that any of those sites exist in the first place.

And here’s what would happen if you ran those searches:

– Data.gov’s search returns no results for “animal sacrifice”

– Data.ny.gov returns data on the number of horses injured or killed at racetracks

– NYC Open Data returns the Brooklyn Public Library Catalogue (wtf?) and a mysteriously named “Multiagency Permits” dataset.

– BetaNYC returns no results for “animal sacrifice”

Exversion’s search engine returns reports from animal services, animal shelters, animal care enforcement, etc. And the best part is most of this data links back to Data.gov … the first site we searched that told us it had no data that fit our query.

Building a Search Engine is Hard

Building search engines isn’t just passe, it’s freaking hard. And gets infinitely harder the more the web grows. The processing power required to crawl, essentially downloading and parsing billions of documents is pretty intense. Then storing that data, indexing it and running queries over those billions and billions of documents requires more resources than most startups can manage. Even with valuations swelling up like they are.

It’s not easy. It’s no longer the low hanging fruit that it was in the nineties when the web was smaller with fewer devices generating content.

Fortunately for us we only need to crawl a very specific subset of the web, which means we can slash the resources needed to complete this task by making a few specific assumptions:

  • The information we want is hosted on specific sites
  • These sites have a common API and structure
  • New information is not added to these sites on a daily basis and old information is rarely if ever updated

Anyone with a computer can create content Google might want to index. Not everyone can run an open data portal. Likewise the number of companies that provide ways of publishing content on the internet is infinite and increasing exponentially while the number of companies providing ways to publish data are only a handful. And of that handful there are two major players who dominate the market: Socrata and CKAN.

All Socrata instances come with a sitemap. It’s not obvious where this is but their robots.txt will give you the link. From there we just careful follow each link and scrape the title and description from each listing of data. It works well because this is what sitemaps are designed to do.

CKAN makes it even easier because CKAN has a pretty decent API. One of the endpoints allows the user to search through the data available on that instance. It requires no authentication and if you don’t provide a query it will return …. well everything.

Even better package search returns all the metadata. So with pagination we can scrape hundreds of thousands of datasets in minutes.

Outliers, Special Snowflakes, and Scrapers As Service

Of course not EVERY site uses either Socrata or CKAN. Most of the world’s open scientific data is on proprietary platforms, Most of the world’s open Geodata is on GeoNode instances. Before we could consider the challenge completed we had to figure out a way to handle sites that didn’t share a standard structure.

I’m awfully fond of Scraping As Service because I’m really not fond of writing individual scrapers every time I need to grab data from a site. But as useful as they are, these companies always seem to have a hard life. ScraperWiki decided they make more money from consulting and closed down their interface for individual devs. Import.io raised venture funding and inexplicably stopped working. 3Taps curates what they will scrape. 80legs tends to accidentally DDOS sites.

About a year ago someone showed me Kimono and there are a few really interesting features that stood out. While Import.io will allow you just to control scrapers via API, you can actually access your data via API with Kimono. You can schedule scrapes to occur regularly. And best of all, Kimono has webhooks.

Which means we could set up Kimono to scrape our outliers and when the scrape is complete, it sends the data directly to our servers via webhook.

Sweet!

Indexing All The Data

The technology is stable, so right now we’re just adding as many sites to our crawl list as possible. We want to prioritize the sites being crawled to make sure that the ones you are most interested in searching get indexed first. You can check out what we’ve indexed so far and what’s on the queue right here. Feel free to add your favorite data site if you don’t see it there.

(photo by pleuntje)