While the wealth of applications built on top of open data is ever increasing, a fundamental problem persists in surfing open data websites. In short you could say that the entire process is nothing short of a sad affair.
For example, take a complete listing of every farmer’s market in the United States, that includes the types of produce one can typically find there, or demographics on the Martial Status of Active Military Personnel. The variety is simply astounding, and while these are only two examples, there is just so much data out there. However getting to those bits of good data can be quite a pain.
People who love data see how gleaming insight from even static sets can help us hack everything from public policy to our daily commute. Unfortunately in order for developers to build these tools it’s not enough for data to be open and public, it also has to be accessible and frequently it isn’t.
Common Problems with Public Data
No APIs – When data is released often it’s just dumped on the internet in Excel spreadsheets. That means the developer has to import the data before building any kind of app. In addition to maintaining the application’s source code, the developer also has to manually download and update new datasets as they are released.
Too Large Data Sets – They’re workable if the dataset is small, but some of the most interesting datasets can be thousands of rows worth of information. While modern databases can easily handle millions of rows, after a certain point importing the data is no longer a simple point and click options.
Data is in the wrong format – CSV stands for COMMA SEPARATED values. Not tab separated values, not pipe separated, not comma separated after a couple of paragraphs of random attribution text, not a giant blob of plain text. Many people just don’t seem to get this. When data isn’t in a standard format it becomes that much harder to work with.
Poorly structured data – Usually the agency collecting the data tries to be as specific and comprehensive as possible in order to maximize the number of applications the data can have. The Department of Education, for example, breaks down student demographics by age, gender, and race for each school. Great if you’re looking to build a visualization of exactly how many female Eskimo fourth graders there are around the country, but not much useful for anything else. A researcher running a professional statistical software package can easily remap datasets like this to suit their needs. However a web or mobile app developer doesn’t have the resources to do all the necessary calculations.
Unclear data sources – Just as facts float around the internet unattributed, people download datasets and modify them all the time. Services like Survey Monkey make it incredibly easy to collect, export and distribute datasets. Crawling and scrapping websites provides another channel of interesting data points. Yet there is no way of confidently tracking the data back to its source and evaluating its legitimacy.
Welcome to Exversion
As Startup Bus NYC hit the road, we started sketching out ideas to solve these problems. As developers and data junkies, we had all dealt with the frustration of projects that get canned at the idea stage because the data was right there but it was just too expensive and time consuming to make it work. While innovations in tech are reducing overhead in all other areas of development, public data remains locked in a torture chamber of csv files, spreadsheets and API-less databases.
To change this Exversion provides the following:
– A built in API to access and query any uploaded dataset.
– Data versioning to track provenance of modified datasets and keep applications up-to-date even with static data.
– Ability to fork data, add to it, modify it and establish a reputation based on your uploads and evaluate fellow users.