From the very beginning version control for data has been a really important part of our vision. There are tools for distributing data; there are tools for versioning files, but there are no good tools for versioning data.
I say “good tools” because data versioning isn’t just about being able to modify data and keep track of the changes. One has to consider why the data is being modified in the first place, a use case that is fundamentally different from version control in code. Version control for source code is used to make changes– add features, fix bugs, refactor, etc. Although projects may split when disagreements between developers and philosophies pop up, the assumption is that everything will eventually be rolled back to one master branch.
Version control with data is about variety. One user needs the data broken down one way, another needs it broken down a different way. At no point will the interests of these two use cases ever merge, the benefits of tracking changes are not about getting everyone on the same page but trying to establish authenticity and accountability.
My favorite example comes from a dataset of school information released by the Department of Education. It looks something like this:
<amalm09>12</amalm09>
<amalf09>7</amalf09>
<asian09>5</asian09>
<asalm09>3</asalm09>
<asalf09>2</asalf09>
<hisp09>58</hisp09>
<hialm09>35</hialm09>
<hialf09>23</hialf09>
<black09>20</black09>
<blalm09>9</blalm09>
<blalf09>11</blalf09>
<white09>159</white09>
<whalm09>76</whalm09>
<whalf09>83</whalf09>
<pacific09>0</pacific09>
<hpalm09>0</hpalm09>
<hpalf09>0</hpalf09>
<tr09>0</tr09>
<tralm09>0</tralm09>
<tralf09>0</tralf09>
I first encountered this data when I was working for a company called Temboo as their Hacker-in-Residence. Engineering couldn’t make sense of it, and couldn’t find any documentation defining what all those codes meant, so they asked for my opinion on it. After a few minutes of picking through a couple of different schools I figured it out: this was student demographics broken down by age, race, and gender. Asian09 is the number of asian ninth graders (5), asalm is the number of MALE asian ninth graders (3).
There might be a use case were that level of granularity is needed, but I imagine that more often than not anyone who wants to use this data has to remap it to something less specific. Version control for data is about tracking that kind of activity so that analysis can be confirmed. It’s pretty easy to contaminant a dataset with faulty assumptions, bad math, or innocent inaccuracies. A clear record of changes data consumers identify potential issues before they build on top of it.
How Exversion Handles Version Control
Last time we talked about some of the things you could do in terms of adding data to repositories through our API. If you’ve read our API documentation you know that you can also edit data through the API as well.
We built in logging from the very beginning, tracking additions to datasets as extensions and changes to the datasets as commits. From those records we can now generate a history for every dataset we host. In practice it looks like this:
Anyone can see this history, but only the dataset owner can make changes.
So… let’s say someone that I gave access to this dataset added some bad data. I can delete that by clicking the X button at the end of the row. Now there’s a new row under changes made to the data, and one less under data added.
But maybe that was a mistake and now I want to undo deleting that data. I can revert commits (including deletions) by again clicking the X button on the end of the row.
Now the deleted data is restored, the changes I made undone. If I wanted to, I could undo the restore, in effect redeleting the data. Or instead I could click on the timestamp and pull up the details for this change:
Future Implementations
Eventually we’d like to allow people to create mathematical transforms that can be run directly on the platform, reproduced as data is added. But for now we’re pretty satisfied with being able to reproduce the benefits of a version control system like git or SVN on a data environment.