Version Control for Data

versioncontrol

From the very beginning version control for data has been a really important part of our vision. There are tools for distributing data; there are tools for versioning files, but there are no good tools for versioning data.

I say “good tools” because data versioning isn’t just about being able to modify data and keep track of the changes. One has to consider why the data is being modified in the first place, a use case that is fundamentally different from version control in code. Version control for source code is used to make changes– add features, fix bugs, refactor, etc. Although projects may split when disagreements between developers and philosophies pop up, the assumption is that everything will eventually be rolled back to one master branch.

Version control with data is about variety. One user needs the data broken down one way, another needs it broken down a different way. At no point will the interests of these two use cases ever merge, the benefits of tracking changes are not about getting everyone on the same page but trying to establish authenticity and accountability.

My favorite example comes from a dataset of school information released by the Department of Education. It looks something like this:


<amalm09>12</amalm09>
<amalf09>7</amalf09>
<asian09>5</asian09>
<asalm09>3</asalm09>
<asalf09>2</asalf09>
<hisp09>58</hisp09>
<hialm09>35</hialm09>
<hialf09>23</hialf09>
<black09>20</black09>
<blalm09>9</blalm09>
<blalf09>11</blalf09>
<white09>159</white09>
<whalm09>76</whalm09>
<whalf09>83</whalf09>
<pacific09>0</pacific09>
<hpalm09>0</hpalm09>
<hpalf09>0</hpalf09>
<tr09>0</tr09>
<tralm09>0</tralm09>
<tralf09>0</tralf09>

I first encountered this data when I was working for a company called Temboo as their Hacker-in-Residence. Engineering couldn’t make sense of it, and couldn’t find any documentation defining what all those codes meant, so they asked for my opinion on it. After a few minutes of picking through a couple of different schools I figured it out: this was student demographics broken down by age, race, and gender. Asian09 is the number of asian ninth graders (5), asalm is the number of MALE asian ninth graders (3).

There might be a use case were that level of granularity is needed, but I imagine that more often than not anyone who wants to use this data has to remap it to something less specific. Version control for data is about tracking that kind of activity so that analysis can be confirmed. It’s pretty easy to contaminant a dataset with faulty assumptions, bad math, or innocent inaccuracies. A clear record of changes data consumers identify potential issues before they build on top of it.

How Exversion Handles Version Control

Last time we talked about some of the things you could do in terms of adding data to repositories through our API. If you’ve read our API documentation you know that you can also edit data through the API as well.

We built in logging from the very beginning, tracking additions to datasets as extensions and changes to the datasets as commits. From those records we can now generate a history for every dataset we host. In practice it looks like this:

dataset history

Anyone can see this history, but only the dataset owner can make changes.

So… let’s say someone that I gave access to this dataset added some bad data. I can delete that by clicking the X button at the end of the row. Now there’s a new row under changes made to the data, and one less under data added.

But maybe that was a mistake and now I want to undo deleting that data. I can revert commits (including deletions) by again clicking the X button on the end of the row.

Now the deleted data is restored, the changes I made undone. If I wanted to, I could undo the restore, in effect redeleting the data. Or instead I could click on the timestamp and pull up the details for this change:

Future Implementations
Eventually we’d like to allow people to create mathematical transforms that can be run directly on the platform, reproduced as data is added. But for now we’re pretty satisfied with being able to reproduce the benefits of a version control system like git or SVN on a data environment.

Dealing with Complex Data on Exversion

complex

Handling multidimensional or nestled data structures was a big part of enabling GeoJSON support for Exversion. The traditional file upload only accepts CSV files, which is fine when your data is flat simple columns and rows, like a basic spreadsheet.

But data within data can’t be processed the same way, and each level of complexity opens up new challenges with file formats, standardization and normalization. It seemed like it would be months before we could offer support for that type of schema.

And then we built the API and it suddenly became super easy.

Step One: Authenticating Without Writing Any Code At All

When it comes to publishing data, a lot of the people we work with are not very technical. They know how to use python, R, or MatLab, but they’re not prepared to develop a whole OAuth client. And yet, the security OAuth offers is essential. We needed to figure out a way to serve these people, while still protecting their data.

The solution was building an OAuth client directly into Exversion specifically for these use cases. If you only want access to your own account, and only your own account, click a few buttons and Exversion will issue a valid access token to sign your requests with.

Obviously if you need to access multiple account, the best solution is to build your own client. Sharing access tokens among multiple parties or multiple applications negates their security benefits, but in cases where you only want to bypass the restrictions of the web upload system and get your data up on your account this is a perfectly acceptable work around.

Step Two: Create a Dataset

Now that you have a valid access token, you have two options to create a dataset. The first is to simply do it through the website, the second is to send a POST request with the same information to the API.

An example in Python (using the Requests module which you will need to install if you don’t have it already):

import requests

url = 'https://www.exversion.com/api/v1/dataset/create'

payload = {"access_token":"YmQ1...", "name":"api create", "description":"this was created with the api","source_url":"http://www.exversion.com", "org":0,"source_author":"Me","source_date":"August 7, 2013", "source_contact":"info@exversion.com", "private":0}

r = requests.post(url, data=json.dumps(payload))

print r.json()

The API will return a JSON response containing a string of numbers and letters that serve as an id for the dataset.

{"status":200,"message":"Success","body":[{"dataset":"EV5XF25RH3MIMPP","sourceURL":"http:\/\/www.exversion.com","sourceAuthor":"Me","sourceDate":"August 7, 2013","sourceContact":"info@exversion.com","uploadedBy":{"id":"1","name":"bellmar"},"description":"this was created with the api","heritage":[],"size":0,"forkchanges":null,"columns":[""]}]}

You can also find the dataset id by pulling up the dataset on Exversion. It’s in the URL:

https://exversion.com/data/view/BVVYOK9DUITSH36/united-states-population-by-state-and-age

Copy it down because you’ll need that ID in the next step.

Step Three: Push Your Data to Exversion

Now that we have a place to put the data, uploading it into the system is just another POST request.

That looks like this:

import requests

mydata = [{"name":"Susan","gender":"f", "age":30},

{"name":"Steve","gender":"m", "age":35},

{"name":"Frank","gender":"m", "age":28}

]

url = 'https://www.exversion.com/api/v1/dataset/push'

payload = {"access_token":"YmQ1...", "dataset":"ZV0C2R4E2MF8LBX", "data":mydata}

r = requests.post(url, data=json.dumps(payload))

If you can load your data into python, you can push it to Exversion. For best results you may want to break the data up into multiple requests. The larger your request the more likely it will timeout. How many items should be in each request really depends on the size of each individual item. The more columns your data has the few number of rows should be in each request.

Here’s a simple script that uses the ijson module to stream a large file and push it to Exversion in chunks of twenty rows at a time:


from ijson import items
import urllib2, requests, json</code>
dataset = 'ZV0C2R4E2MF8LBX';
data = [];
def send_request(dataset, data):
url = 'https://www.exversion.com/api/v1/dataset/push&#39;
payload = {"access_token": "YmQ1…", "dataset":dataset, "data":data}
r = requests.post(url, data=json.dumps(payload))
if r.status_code is not 200:
print r.text
f = urllib2.urlopen('http://localhost:8888/world.json&#39;)
i = 0
for item in items(f, 'item'):
if i < 20:
data.append(item)
i = i + 1
else:
data.append(item)
print 'Fire request'
send_request(dataset, data)
data = []
i = 0
#clear out the leftovers
if data:
send_request(dataset, data)

view raw

gistfile1.txt

hosted with ❤ by GitHub

Another thing to be aware of is that Exversion will define the schema of your dataset based on the first row of data submitted to it. Which means if the first row is missing some columns, Exversion will reject any data submitted afterwards with values in those columns. So make sure that the first rows of your dataset are complete before pushing.

Step Four: Querying Multidimensionally

Through the API nestled json objects are easily converted into queriable data. All we need to do to have Exversion filter data by columns inside columns is specify the levels. Instead of:

{"key":"[YOUR KEY]","merge":0, "query":[{"dataset":"MXZHQGZSVH8484K", "params":{"state":"CA"}, "_limit":3}]}

We would run…

{"key":"[YOUR KEY]","merge":0, "query":[{"dataset":"MXZHQGZSVH8484K", "params":{"location.city":"Oakland"}, "_limit":3}]}

To get at the city subcolumn within the location column.

Caveats and Gotchas

If you’re using multidimensional querying to access GeoJSON information, remember that the API automatically paginates at 250 rows of data. You can read more about adjusting how many rows are returned at once in our API documentation.

Going geospatial with Exversion

nycstmen

Image by Stamen Maps

Earlier this week we gave an impromptu and quick overview of Exversion at #NYC Beta‘s meetup. The majority of the talk revolved around some of the idiosyncrasies of PLUTO and MapPLUTO, and the audience, a largely geospatial crowd, wanted to know what GIS functionality if any we support.

While all we can say is that geospatial is dear to our hearts, at present all API output is for the time being in JSON. However, if the dataset contains latitude/longitude or x/y coordinates you should be able to use it with popular mapping libraries such as leaflet, and D3.js, as well as Google Maps, Bing Maps, et al., allowing you to map those JSON objects though our API.

An sample dataset that this would work with is one we featured during this years Publishing Hackathon, held during Book Expo America, Banned and Challenged Books.

latlongjsonWhen we run a simple search query on it, or look at the data preview on the dataset’s page, we see that it contains both latitude and longitude columns, along with other information about the challenged title, city, state, challenger, and other details.

The coordinates in the dataset, simply allow us to load a generic JSON layer, and display points on a map, such as in this Publishing Hackathon example by Jackon Lin who used the Banned and Challenged Books dataset in his visualization. *Displayed at the bottom of the page.

While this for the time being isn’t a complete answer to a GIS Data API, it’s a step in the right direction, and as we develop Exversion further, we hope to build in geospatial functionality that will make is easy, simple, and intuitive to import data hosted on the platform to a wide suite of geospatial data visualization tools.

And for the time being, if you build any apps, geo or other on the platform, we would love to see them. So please send your work to info @ exversion.com and we’ll try to feature as many of these as we can.

Now go click on that map and see what books people have tried to ban in the United States.

exampleapp

And we’re off, Exversion is now available to everyone.

data

original photo by NASA Goddard Photo and Video

We’re absolutely ecstatic to announce that today, August 7th, we’ve moved from alpha to beta, and as such, have opened the platform up to everyone.

Until today, data was stored in independent silos across the Internet and was often inaccessible. With this launch we’ve made over 40,000 datasets easily searchable from a number of sources, and will be adding additional data in the coming progressively moving forward.

While this data is now searchable, much of it remains unusable and we ask that the community help us in cleaning up the worlds data. With the platform you are now able to upload file of up to 10MB in your browser, but more importantly now also have access to upload much larger datasets programatically.

Continue reading

Every piece of NYC’s real estate data is now accessible through our API

NYC PLUTO EXVERSION

This week we announced that The City of New York Primary Land Use Tax Lot Output (PLUTO) database is now machine readable. Less than a week after the City made the database publicly available, we’ve made all PLUTO data readily queryable and freely available via the Exversion API.

This means that city planners, community boards, researchers and other people seeking commercial and residential real estate data can quickly and easily search hundreds of thousands records.

Normally you’d have to pay the city for this data, clean it, upload it to your own server, now that’s it’s machine readable, anyone with an internet connection can instantaneously start deriving insight from this data, and we’re very excited to see what people do with this data, the types of application they build with it and what they’ll be able to uncover. Continue reading