happy [data driven] holidays from EXVERSION

Image

This holiday season we wish you all the best. Here’s a few fun facts about all things holiday.

1. The five most popular xmas carol words in order are christ, king, born, night, joy.

2. Rudolph is the most influential reindeer, having 8x the influence of Dasher, Dancer, Prancer, Vixen, Comet, Cupid, Donner and Blitzen.

3. Your best chance of a white holiday is in Northern Minnesota with a >90% chance of there being snow.

4. In 2013 Thanksgiving and Chanukah overlapped, the last time this happened was in 1918, except for Texas where it happened in 1945, and 1956. It will theoretically happen again in 2070.

5. The most expensive hotel to spend New Year’s Eve in is the Royal Suite at the Burj al Arab in the United Emirates at $72,000 including taxes. Ouch!

All the best, and again, happy holidays,

– EXVERSION

Version Control for Data

versioncontrol

From the very beginning version control for data has been a really important part of our vision. There are tools for distributing data; there are tools for versioning files, but there are no good tools for versioning data.

I say “good tools” because data versioning isn’t just about being able to modify data and keep track of the changes. One has to consider why the data is being modified in the first place, a use case that is fundamentally different from version control in code. Version control for source code is used to make changes– add features, fix bugs, refactor, etc. Although projects may split when disagreements between developers and philosophies pop up, the assumption is that everything will eventually be rolled back to one master branch.

Version control with data is about variety. One user needs the data broken down one way, another needs it broken down a different way. At no point will the interests of these two use cases ever merge, the benefits of tracking changes are not about getting everyone on the same page but trying to establish authenticity and accountability.

My favorite example comes from a dataset of school information released by the Department of Education. It looks something like this:


<amalm09>12</amalm09>
<amalf09>7</amalf09>
<asian09>5</asian09>
<asalm09>3</asalm09>
<asalf09>2</asalf09>
<hisp09>58</hisp09>
<hialm09>35</hialm09>
<hialf09>23</hialf09>
<black09>20</black09>
<blalm09>9</blalm09>
<blalf09>11</blalf09>
<white09>159</white09>
<whalm09>76</whalm09>
<whalf09>83</whalf09>
<pacific09>0</pacific09>
<hpalm09>0</hpalm09>
<hpalf09>0</hpalf09>
<tr09>0</tr09>
<tralm09>0</tralm09>
<tralf09>0</tralf09>

I first encountered this data when I was working for a company called Temboo as their Hacker-in-Residence. Engineering couldn’t make sense of it, and couldn’t find any documentation defining what all those codes meant, so they asked for my opinion on it. After a few minutes of picking through a couple of different schools I figured it out: this was student demographics broken down by age, race, and gender. Asian09 is the number of asian ninth graders (5), asalm is the number of MALE asian ninth graders (3).

There might be a use case were that level of granularity is needed, but I imagine that more often than not anyone who wants to use this data has to remap it to something less specific. Version control for data is about tracking that kind of activity so that analysis can be confirmed. It’s pretty easy to contaminant a dataset with faulty assumptions, bad math, or innocent inaccuracies. A clear record of changes data consumers identify potential issues before they build on top of it.

How Exversion Handles Version Control

Last time we talked about some of the things you could do in terms of adding data to repositories through our API. If you’ve read our API documentation you know that you can also edit data through the API as well.

We built in logging from the very beginning, tracking additions to datasets as extensions and changes to the datasets as commits. From those records we can now generate a history for every dataset we host. In practice it looks like this:

dataset history

Anyone can see this history, but only the dataset owner can make changes.

So… let’s say someone that I gave access to this dataset added some bad data. I can delete that by clicking the X button at the end of the row. Now there’s a new row under changes made to the data, and one less under data added.

But maybe that was a mistake and now I want to undo deleting that data. I can revert commits (including deletions) by again clicking the X button on the end of the row.

Now the deleted data is restored, the changes I made undone. If I wanted to, I could undo the restore, in effect redeleting the data. Or instead I could click on the timestamp and pull up the details for this change:

Future Implementations
Eventually we’d like to allow people to create mathematical transforms that can be run directly on the platform, reproduced as data is added. But for now we’re pretty satisfied with being able to reproduce the benefits of a version control system like git or SVN on a data environment.

Going geospatial with Exversion

nycstmen

Image by Stamen Maps

Earlier this week we gave an impromptu and quick overview of Exversion at #NYC Beta‘s meetup. The majority of the talk revolved around some of the idiosyncrasies of PLUTO and MapPLUTO, and the audience, a largely geospatial crowd, wanted to know what GIS functionality if any we support.

While all we can say is that geospatial is dear to our hearts, at present all API output is for the time being in JSON. However, if the dataset contains latitude/longitude or x/y coordinates you should be able to use it with popular mapping libraries such as leaflet, and D3.js, as well as Google Maps, Bing Maps, et al., allowing you to map those JSON objects though our API.

An sample dataset that this would work with is one we featured during this years Publishing Hackathon, held during Book Expo America, Banned and Challenged Books.

latlongjsonWhen we run a simple search query on it, or look at the data preview on the dataset’s page, we see that it contains both latitude and longitude columns, along with other information about the challenged title, city, state, challenger, and other details.

The coordinates in the dataset, simply allow us to load a generic JSON layer, and display points on a map, such as in this Publishing Hackathon example by Jackon Lin who used the Banned and Challenged Books dataset in his visualization. *Displayed at the bottom of the page.

While this for the time being isn’t a complete answer to a GIS Data API, it’s a step in the right direction, and as we develop Exversion further, we hope to build in geospatial functionality that will make is easy, simple, and intuitive to import data hosted on the platform to a wide suite of geospatial data visualization tools.

And for the time being, if you build any apps, geo or other on the platform, we would love to see them. So please send your work to info @ exversion.com and we’ll try to feature as many of these as we can.

Now go click on that map and see what books people have tried to ban in the United States.

exampleapp

And we’re off, Exversion is now available to everyone.

data

original photo by NASA Goddard Photo and Video

We’re absolutely ecstatic to announce that today, August 7th, we’ve moved from alpha to beta, and as such, have opened the platform up to everyone.

Until today, data was stored in independent silos across the Internet and was often inaccessible. With this launch we’ve made over 40,000 datasets easily searchable from a number of sources, and will be adding additional data in the coming progressively moving forward.

While this data is now searchable, much of it remains unusable and we ask that the community help us in cleaning up the worlds data. With the platform you are now able to upload file of up to 10MB in your browser, but more importantly now also have access to upload much larger datasets programatically.

Continue reading

Every piece of NYC’s real estate data is now accessible through our API

NYC PLUTO EXVERSION

This week we announced that The City of New York Primary Land Use Tax Lot Output (PLUTO) database is now machine readable. Less than a week after the City made the database publicly available, we’ve made all PLUTO data readily queryable and freely available via the Exversion API.

This means that city planners, community boards, researchers and other people seeking commercial and residential real estate data can quickly and easily search hundreds of thousands records.

Normally you’d have to pay the city for this data, clean it, upload it to your own server, now that’s it’s machine readable, anyone with an internet connection can instantaneously start deriving insight from this data, and we’re very excited to see what people do with this data, the types of application they build with it and what they’ll be able to uncover. Continue reading

Hack Publishing

pub-hack-logo

So tomorrow we take Exversion for its first major test drive with hackers at The Alley’s Publishing Hackathon. While still in alpha and working out our fair share of bugs, we’ve preloaded a number of datasets that should be interesting to build on top of.

METADATA FOR SELECTED PERSEUS BOOKS GROUP TITLES

Organizing sponsor Perseus was originally planning on making this data available to hackers through an excel spreadsheet. We convinced them that an API would work a little better.

QUALITATIVE LITERARY ANALYSIS

Ages ago I built a platform filled with tools that tracked reader reaction and produced analytics to help writers workshop their writing. This is a dataset I used to build out some of those algorithms. It includes the readability scores, sentence structure, and word counts of many bestsellers.

BOOK CROSSING LOCATION BASED BOOK REVIEWS

Probably the most complicated of the datasets we cleaned and loaded. This is a scrape of Bookcrossings that tags book reviews with locations and ages of reviewers. Really interested in seeing if there are any trends in what books people tend to like in certain places.

AWARD WINNING BOOKS

Data on who made the “Best of” lists for 2011 and 2012. I was surprised by how many books on the list I had never heard of … which should make it good fodder for recommendations.

And lastly… for those of you with a more mischievous edge to your hacking. I managed to dig up two data sets on banned books:

BANNED AND CHALLENGED BOOKS 1990-2012

BOOKS BANNED IN TEXAS PRISONS BY THE TDCJ

Happy Hacking 🙂

P.S. For those of you who love to hack in rails, here’s a link to a ruby library on GitHub that will allow you to access our API.

Welcoming you to Exversion from a highway in Georgia.

buswp

Headed for SXSW we left New York on Sunday at approximately 8 am. Since then, we’ve seen the White House, stopped in Raleigh to catch the back end of a Hackathon and give a wholly unprepared pitch to a few entrepreneurs and investors after only conceiving the idea a few early hours before.

What is this madness you may think? Easy. It’s Startup Bus and we’re Exversion, a new startup founded on the road that’s launching the world’s first ever social collaboration platform for data.

What’s this mean? Well for the many of us who work with data, and more frequently than not, open data, we all run into the same problem. Spend hours diving in through countless resources, only to download a dataset that fits the project you want to work on, but find that it’s either totally unusable, or, that you’ll spend the equivalent amount of time in excel or google refine getting the data to do what you actually want to convey. Continue reading