happy [data driven] holidays from EXVERSION

Image

This holiday season we wish you all the best. Here’s a few fun facts about all things holiday.

1. The five most popular xmas carol words in order are christ, king, born, night, joy.

2. Rudolph is the most influential reindeer, having 8x the influence of Dasher, Dancer, Prancer, Vixen, Comet, Cupid, Donner and Blitzen.

3. Your best chance of a white holiday is in Northern Minnesota with a >90% chance of there being snow.

4. In 2013 Thanksgiving and Chanukah overlapped, the last time this happened was in 1918, except for Texas where it happened in 1945, and 1956. It will theoretically happen again in 2070.

5. The most expensive hotel to spend New Year’s Eve in is the Royal Suite at the Burj al Arab in the United Emirates at $72,000 including taxes. Ouch!

All the best, and again, happy holidays,

– EXVERSION

Feasting on Data

dafeastA year ago I heard about a little hackathon being held in the middle of a high-powered social good conference called The Feast. Hackers who signed up got full access to the conference and its many perks. It sounded like a good deal to me, so I signed up.

The challenge of The Feast hackathon is building things with data. At the time I was just coming to terms with the gap between the possibilities of data and the reality. Participating helped me articulate everything that bothered me with the current data infrastructure. Instead of doing a traditional hackathon demo, I ended up using my time on the main conference stage giving a mini lesson in why hackathon ideas never become real life solutions (and also, a tutorial on brute force hacking for newbies)

(my part starts at 7:44)

A big target for me in that speech was The Khan Academy‘s API, which gives you full access to all the Khan Academy’s educational resources … with one catch.

There’s no good way to search for content. So in order to get what you need, you have to know exactly what you need and the exactly where it is. Oh sure, there’s a topic tree, but you can’t query it. So in order to get that information you have to download and parse the entire json file … all 30MBs of it.

This year Exversion was approached by The Feast to provide data for the hackathon, so I set about fixing the Khan Academy problem. It took me some time to figure out how best to untangle a tree when each branch was an infinite number of levels deep, but once I did I was able to produce a query-able directory of The Khan Academy’s content. Why should you have to sort through all the economics content when what you want is videos on basic addition? Now you can grab your Exversion API key and do this:

https://exversion.com/api/v1/dataset/3QC5N5HXIGSJK3Z?key=xxx&title=addition

You can even use complex queries to search the videos themselves, Check out our API documentation for more details.

Differential Diagnosis in Production : The Outcome

it works exversion

(At last the conclusion to Differential Diagnosis in Production. Could you figure out what was wrong with the server? Here’s the answer…)

As panic began to set in, I remembered something useful. Some commands on Symfony’s console are more picky than others. They will not execute if there’s an error, even if that error is something small in a file that has nothing to do with what the command should be doing. In the past I found this strictness super annoying, but now I realized it could be useful. If the server had stopped writing errors to the logs, than perhaps I could force it to write the errors to standard output instead. It was a long shot, but I didn’t seem to have any other options.

And it worked.

I chose php app/console doctrine:schema:update --force, which checks that the database matches the ORM schema outlined in the code. Since there were no updates, the script should run and report back that everything was up-to-date. But instead out came out a big red error block.

Son of a– We were out of disk space.

“That’s impossible,” Jacek said. “We have like 80GBs.”

And yet pulling up the disk usage confirmed it, the web root was using over 75GB of space. Suddenly everything made sense. The problem had nothing to do with Symfony’s cache, but clearing the cache opened up enough space to allow Exversion to operate. And when the server didn’t even have enough space to write to the logs, deleting the cache stopped fixing the problem.

So one question remained: Symfony doesn’t use that much space. Our entire development environment, all components of it, is only about 2GB. What was eating up all that space?

In a word: downloads.

How Do People Want Their Data?
Bulk data downloads were never part of our ultimate vision, but we had to provide them in order to give people the ability to branch and remap the data they wanted. It was an inefficient solution, but until we finished the features that allowed people to edit their data through Exversion directly (and we’re almost there) we had to offer it.

What we didn’t anticipate is how popular downloading data would be. And why would we? All of our competitors focus on serving files. They are older, more established, larger companies with greater visibility. Everyone who comes to Exversion has had experience with them.

So we assumed that what people wanted from us were features that our competition didn’t offer. After all, why download a file through us when you can go directly to the source and download it there?

Part of the answer is that people clean data found on those other sites and upload it to Exversion because they can’t upload them on our competitors. Another reason is that we index all those other sites for unified search. There are some who would say no one needs data search because Google does such a good job indexing everything … and those people have clearly never actually tried to find data through Google. It’s hard. It’s really freaking hard. We have three upcoming events where we’ve been given the rank and privileges of sponsors just for helping the organizers find data.

But back to our technical troubles, Exversion doesn’t host files. Downloads are generated dynamically based on the current state of the data. To keep the load on the database down, we don’t delete the file generated right away but clear them out on a regular basis. All that was working fine. We just didn’t anticipate how popular downloading would be. To put it simply: we weren’t clearing out downloads often enough.

Earthquakes, Drones and Gun Violence
Once we realized what the problem was, our first assumption was that this was all about PLUTO. PLUTO (for those that don’t know) is a massive collection of real estate data split into two parts: metadata and shape files. We released PLUTO, including a GeoJson version of those shape files, about a month ago. Obviously downloads were eating up more space faster because people were downloading these much larger files.

….Nope.

This is what people have been downloading

There were over 4,000 downloads between the regular clearing of files and the crash. And hundreds more have been downloaded since Labor Day. The most popular of the MapPluto datasets was Queens, followed by Staten Island of all things. Manhattan is the least popular.

Then there are the earthquake datasets, which are overwhelmingly popular both on Exversion and on Data.gov where they originate. I have no idea what people use them for.

Do We Change What We’re Doing?
Month ago I wrote a somewhat contraversial post about cognitive biases in tech. Some interpreted it as a criticism of Paul Gramham and YC, but if it was it was even more so a criticism of myself. The post spent much more time detailing false assumptions that I had made … and apparently continue to make.

Exversion had a system for reporting the number of downloads. There was a bug in it, which was why we didn’t notice the escalating level of activity until it took us down. I actually had noticed the bug months before all this, but I did not make fixing it a priority because it confirmed what I already wanted to believe about what we were doing. In other words, fixing the download counter seemed less important than handling other things because I assumed that users weren’t as interested in downloading data as they were in other features … but as long as the download counter was broken we couldn’t collect the data that proved me wrong. That obvious fact never occurred to me … which is ironic really.

So, once discovered, the problem was easy to fix: I cleared out the downloads, rewrote the process of generating downloads to be more resource efficient, fixed the download counter (adding some extra logging of that activity to boot) and tweaked the cron job that clears downloads to clean up that folder more often. For the time being we’ve disabled downloading of our much larger Pluto files, because we have to figure out how this changes things for us.

I built the download system based on the idea that it was a temporary solution that would ultimately be phased out. It was neither designed nor intended for the type of usage its seeing. I’m not sure what its performance is like on those large repositories, to be honest.

And since we are getting a lot of activity in this area the question remains, do we need to rethink our philosophy about the future of data? Should we make downloads more of a priority and redirect resources to building better infrastructure there? Plenty in Silicon Valley would say “Yes, absolutely”, but if the world was run by venture capitalists we would have more apps for Miley Cyrus to humiliate herself on and no roads. Building infrastructure is an unglamorous thing. It’s value can’t be gauged with public opinion, because it’s value doesn’t appear until someone builds something on top of it.

Lately we’ve been having more and more meetings with NGOs looking for a way to share and collaborate data. The feedback we’ve gotten from them suggests we’re on the right track. There’s a need that no one is serving that we are uniquely positioned to fill … and once implemented the solution could be truly game changing.

So we’ll see.

Differential Diagnosis in Production


This week we celebrated Labor Day by giving our web server some time off 🙂 Why should the relaxing qualities of holidays be reserved for conscious life forms only? One day the machines will rule us all, might as well start making nice now.

But seriously, the process of diagnosing and debugging a critical failure is not something often talked about, yet every technologist will have to deal with it eventually. Even if every piece of code you push is absolutely perfect, modern day web programming has too many moving parts … you can’t be an expert on all of them.

And when your service is down, the clock is ticking, and the problem appears to be buried deep in something beyond your expertise, everything depends on your resourcefulness. So, yes, this weekend Exversion suffered a critical failure that took the site offline for about twenty minutes on Sunday and about two hours on Monday, but the process of tracking and eliminating the problem turned out to be pretty interesting.

Early Symptoms
It started with an email from Lee Byron. Mixed in with some general feedback was an offhand comment about difficulty logging in. The login seemed to be “broken” but no other details were provided.

Neither Jacek or I were able to reproduce the issue, nor were any of our other users complaining. I emailed Lee back anyway, asking for more information about the error he got (if any). At this point there seemed like two likely explanations:

  • Some strange cross browser issue.
  • He hadn’t activated his account by verifying his email address

When I didn’t hear back I assumed the issue had been “user error” and took it off my agenda.

Locked Out
A few days later on Saturday morning, I was giving a demo to a perspective client on his computer. I logged in, only it didn’t take. No errors, everything appeared to work, but the “Login/Sign Up” buttons were still on the top nav and the members only features weren’t displaying.

“Do you have any kind of privacy, anonymous surfing software installed on this machine?” I asked.

“No, not that I know of.”

Exversion is built on Symfony2, using Friends of Symfony’s UserBundle (with some custom tweaks to the invitation recipe to suit our purposes) for User Management. By default this stores sessions in Symfony’s cache.

So the first thing I did when I encountered this problem myself was SSH into the server and clear the cache.

I could then login without any problems, and the rest of the meeting went really well. But obviously this wasn’t a real solution. Why wasn’t Symfony creating user sessions correctly? Was the problem with Symfony itself or FOSUserBundle? I started googling and digging through github issue reports, trying to find a clue … but there was nothing.

Original Diagnosis
There also wasn’t anything useful in either Symfony’s logs or apache’s so the best explanation I could come up with on virtually no information was that old sessions weren’t being cleared away and were filling up the cache. The solution seemed to be to move sessions out of the cache itself, something I had been meaning to do for a while anyway because I push code to Exversion every day and every time I clear the cache it logs everyone out.

But I wasn’t familiar with the procedure for moving the sessions folder and I wanted to read up on it and run some tests before going through with it. Like most, I prefer to do major changes to the infrastructure somewhere around 3am~4am so that if something goes wrong (God forbid) the impact to users is minimum.

And unfortunately I had to be up early on Sunday to head out to my parents. I hadn’t visited my family in about a good two months … pretty bad seeing as they live basically an hour away. So I figured I would keep an eye on the server and leave the permanent fix until I was back in the office on Tuesday.

Offline (part one)
I was in the middle of having a mojito with my mother and looking at pictures of their cruise to Alaska when I got the email from Jacek:

Not to ruin your weekend, but can’t login and can’t register on the site :/

login redirects to splash page, and register gives me a 503. :/

Effin’ hell … already? “Okay,” I texted back. “It’s a problem with the cache, I’ll fix it. Don’t worry.”

I SSHed into the server and cleared the cache. Then I reloaded Exversion.com expecting to be able to login and go back to my drink. Only what I got instead was a 500 error … and not one of our pretty custom ones either.

The site was dead.

HOLY CRAP, Fix It Now!
I spent the better part of the next twenty minutes texting back and forth with Jacek. He had limited wifi at his apartment and couldn’t get into the server but was able to confirm that the server itself was still running. The worst part of dealing with these sort of things is that Exversion is at present a two person team. I write the code and design most of the infrastructure, Jacek maintains the servers and does the UX. When something unexpected has gone wrong the best asset is another perspective or (barring that) the ability to step away from the problem and clear your mind. But we needed to get back online ASAP, and with only two of us fresh perspectives and breaks were in short supply.

I went through the usual rounds of magic tricks: I restarted apache, I tried clearing the cache again. Symfony’s logs now had some useful errors in them announcing that Twig could not write to the cache folder.

99.9999% of the time this means the permissions haven’t been set correctly. The cache folder is owned by the root user, but apache needs to be able to write to it.

And this made absolutely no sense because the permissions had not been changed and Exversion had been functioning correctly with those settings for months. How could a folder change its own permissions magically? It couldn’t.

But I didn’t have any better ideas, so I tried resetting the permissions. No luck. Examined the subfolders to make sure the command had executed recursively. No dice. Finally I just got so  frustrated I deleted the entire cache folder, recreated it, reset its permissions and reloaded the page.

Bingo. We were back up.

At this point let me stop and quickly explain something for people not familiar with Symfony. Symfony has a neat set of command line functions that help with a variety of administrative tasks. The cache is separated out into multiple folders for each environment (although most applications will only have two: development and production). The command to clear the cache requires you to specify which environment to clear, and it won’t touch the other folders in the cache. If you specify no environment, it will clear development by default.

In addition, Symfony rebuilds its cache when it detects something missing. So deleting the entire cache directory and recreating it wasn’t as extreme as it might sound. Symfony would rebuild the folder structure it needed automatically.

Offline (part two)
I ran a few tests, confirmed the site was back online and back to fully functioning, then I logged on to Exversion’s official Twitter account and announced that we were working through a problem with Symfony’s cache and to let us know about any errors.

I went back to google. It no longer seemed likely that this was a problem with user sessions. Could it be some bug in an update to Symfony core? Could it be an apache issue? I was having a hard time identifying anything useful and yet I knew that it was only a matter of time before the problem struck again.

It happened about 3:30 the next afternoon

I confirmed the issue myself and tried deleting the cache again, hoping another rebuild would work the same magic.

No luck.

The whole site was down again. Now not even the front page would load. To make matters worse the logs were not updating either. The server log was cut off midline, which I had never seen before, while Symfony’s log wasn’t updating with any errors.

I was flying blind.

About a month ago, just before we came out of alpha, Swift Mailer stopped sending emails. Since we were in private alpha, it wasn’t a major tragedy, but– like now– there weren’t any errors being thrown. I spent the whole day working through every component of our email system, until I figured out it was a bad update. I tend not to expect large, widely used open source projects to fail … but as soon as I updated the package the problem was solved.

So now with no other options, I ran a composer update and prayed.

Nothing.

Exversion was broken and I had no idea what was wrong or how to fix it.

(To Be Continued. Part Two with the solution will be posted on Sunday. For now, take your best guess. What troubles this ailing server?)

Version Control for Data

versioncontrol

From the very beginning version control for data has been a really important part of our vision. There are tools for distributing data; there are tools for versioning files, but there are no good tools for versioning data.

I say “good tools” because data versioning isn’t just about being able to modify data and keep track of the changes. One has to consider why the data is being modified in the first place, a use case that is fundamentally different from version control in code. Version control for source code is used to make changes– add features, fix bugs, refactor, etc. Although projects may split when disagreements between developers and philosophies pop up, the assumption is that everything will eventually be rolled back to one master branch.

Version control with data is about variety. One user needs the data broken down one way, another needs it broken down a different way. At no point will the interests of these two use cases ever merge, the benefits of tracking changes are not about getting everyone on the same page but trying to establish authenticity and accountability.

My favorite example comes from a dataset of school information released by the Department of Education. It looks something like this:


<amalm09>12</amalm09>
<amalf09>7</amalf09>
<asian09>5</asian09>
<asalm09>3</asalm09>
<asalf09>2</asalf09>
<hisp09>58</hisp09>
<hialm09>35</hialm09>
<hialf09>23</hialf09>
<black09>20</black09>
<blalm09>9</blalm09>
<blalf09>11</blalf09>
<white09>159</white09>
<whalm09>76</whalm09>
<whalf09>83</whalf09>
<pacific09>0</pacific09>
<hpalm09>0</hpalm09>
<hpalf09>0</hpalf09>
<tr09>0</tr09>
<tralm09>0</tralm09>
<tralf09>0</tralf09>

I first encountered this data when I was working for a company called Temboo as their Hacker-in-Residence. Engineering couldn’t make sense of it, and couldn’t find any documentation defining what all those codes meant, so they asked for my opinion on it. After a few minutes of picking through a couple of different schools I figured it out: this was student demographics broken down by age, race, and gender. Asian09 is the number of asian ninth graders (5), asalm is the number of MALE asian ninth graders (3).

There might be a use case were that level of granularity is needed, but I imagine that more often than not anyone who wants to use this data has to remap it to something less specific. Version control for data is about tracking that kind of activity so that analysis can be confirmed. It’s pretty easy to contaminant a dataset with faulty assumptions, bad math, or innocent inaccuracies. A clear record of changes data consumers identify potential issues before they build on top of it.

How Exversion Handles Version Control

Last time we talked about some of the things you could do in terms of adding data to repositories through our API. If you’ve read our API documentation you know that you can also edit data through the API as well.

We built in logging from the very beginning, tracking additions to datasets as extensions and changes to the datasets as commits. From those records we can now generate a history for every dataset we host. In practice it looks like this:

dataset history

Anyone can see this history, but only the dataset owner can make changes.

So… let’s say someone that I gave access to this dataset added some bad data. I can delete that by clicking the X button at the end of the row. Now there’s a new row under changes made to the data, and one less under data added.

But maybe that was a mistake and now I want to undo deleting that data. I can revert commits (including deletions) by again clicking the X button on the end of the row.

Now the deleted data is restored, the changes I made undone. If I wanted to, I could undo the restore, in effect redeleting the data. Or instead I could click on the timestamp and pull up the details for this change:

Future Implementations
Eventually we’d like to allow people to create mathematical transforms that can be run directly on the platform, reproduced as data is added. But for now we’re pretty satisfied with being able to reproduce the benefits of a version control system like git or SVN on a data environment.

Going geospatial with Exversion

nycstmen

Image by Stamen Maps

Earlier this week we gave an impromptu and quick overview of Exversion at #NYC Beta‘s meetup. The majority of the talk revolved around some of the idiosyncrasies of PLUTO and MapPLUTO, and the audience, a largely geospatial crowd, wanted to know what GIS functionality if any we support.

While all we can say is that geospatial is dear to our hearts, at present all API output is for the time being in JSON. However, if the dataset contains latitude/longitude or x/y coordinates you should be able to use it with popular mapping libraries such as leaflet, and D3.js, as well as Google Maps, Bing Maps, et al., allowing you to map those JSON objects though our API.

An sample dataset that this would work with is one we featured during this years Publishing Hackathon, held during Book Expo America, Banned and Challenged Books.

latlongjsonWhen we run a simple search query on it, or look at the data preview on the dataset’s page, we see that it contains both latitude and longitude columns, along with other information about the challenged title, city, state, challenger, and other details.

The coordinates in the dataset, simply allow us to load a generic JSON layer, and display points on a map, such as in this Publishing Hackathon example by Jackon Lin who used the Banned and Challenged Books dataset in his visualization. *Displayed at the bottom of the page.

While this for the time being isn’t a complete answer to a GIS Data API, it’s a step in the right direction, and as we develop Exversion further, we hope to build in geospatial functionality that will make is easy, simple, and intuitive to import data hosted on the platform to a wide suite of geospatial data visualization tools.

And for the time being, if you build any apps, geo or other on the platform, we would love to see them. So please send your work to info @ exversion.com and we’ll try to feature as many of these as we can.

Now go click on that map and see what books people have tried to ban in the United States.

exampleapp

API Wrapper for Machine-Readable Data

As we crawl and index the world’s data silos, we’ve noticed that some have APIs, to this extent we’ve gone ahead and created an API wrapper for those data sets that are already accessible in a machine readable format.

The benefit of this is that you’re now able to build on top of both NYC data and Chicago data through the Exversion API without having to go to either city’s open data site.

API-Access

While the functionality is there, some problems still persist, primarily that we’re limited to the API functionality of the host website, and henceforth some API functions may not be available.

For example, while you can query a data set housed on Exversion to give you all results that have a single word in a column, the same cannot be said for external data we’ve wrapped from places such as NYC, San Francisco, or Chicago, where you’ll need to query a full text.

Upload data Which is why we urge you to download the data set and upload it to the platform to take full advantage of our API, and you can do this easily, by clicking the Menu box and selecting download and then CSV, and then clicking the Green Upload button. When you do this, the meta, description, and data attribution fields will be automatically populated. Just add a blurb describing the changes (if any), and upload the file.

For the above data set, City of Chicago / Christmas Tree Drop of Locations, the entire process took approximately a minute, making the data readily consumable by anyone on the site, and when you or anyone else searches for the data on Exversion, because it’s now hosted, it will appear at the top of your search.

 

You can now search over 20k data sets on Exversion.

1014389_10151819679099705_11681589_n

At the end of June we added metadata for over 20,000 data.gov data sets on Exversion, and are in the process of adding thousands of other data sets that are housed on CKAN installations throughout the world.

However, the fundamental problem with CKAN meta data aggregation however is that the CKAN API will not let you query against the actual data set, but instead query what types of data sets there are on each installation of the software, making the platform useless in terms of machine readable data.

This model distributes static .csv files along with secondary and tertiary links to off-site data, making it very difficult to aggregate much of anything aside from the link / meta data.  As such, we’re asking you, the crowd, to help populate these data sets. In order to foster this process we’ve provided a simple applet that will allow you to upload a specific data set.

Screen Shot 2013-07-02 at 11.15.04 AM

In the above example, you see that I’ve searched for Crash Statistics by state on Exversion, but the data set is yet to have been imported, namely it needs to be cleaned. Following our style guide, I quickly made the dataset “Exversion ready” i.e. a CSV file, and uploaded to the site with a description of the changes, making the dataset is then available here.

While this is not the perfect solution to the larger problem of not having easily accessible machine readable data, it allows the data community to come together and help make data that has been previously inaccessible, machine readable.

At the same time, while this is the status quo for data housed / linked to on CKAN installations, we’re working on a few projects that should wholly integrate data housed on other platforms.

If you guys have any questions ask them in the comments of feel free to write us at info @ exversion.com

The Sad State of Open Data

bloks

While the wealth of applications built on top of open data is ever increasing, a fundamental problem persists in  surfing open data websites. In short you could say that the entire process is nothing short of  a sad affair.

For example, take a complete listing of every farmer’s market in the United States, that includes the types of produce one can typically find there, or demographics on the Martial Status of Active Military Personnel. The variety is simply astounding, and while these are only two examples, there is just so much data out there. However getting to those bits of good data can be quite a pain.

People who love data see how gleaming insight from even static sets can help us hack everything from public policy to our daily commute. Unfortunately in order for developers to build these tools it’s not enough for data to be open and public, it also has to be accessible and frequently it isn’t. Continue reading

Welcoming you to Exversion from a highway in Georgia.

buswp

Headed for SXSW we left New York on Sunday at approximately 8 am. Since then, we’ve seen the White House, stopped in Raleigh to catch the back end of a Hackathon and give a wholly unprepared pitch to a few entrepreneurs and investors after only conceiving the idea a few early hours before.

What is this madness you may think? Easy. It’s Startup Bus and we’re Exversion, a new startup founded on the road that’s launching the world’s first ever social collaboration platform for data.

What’s this mean? Well for the many of us who work with data, and more frequently than not, open data, we all run into the same problem. Spend hours diving in through countless resources, only to download a dataset that fits the project you want to work on, but find that it’s either totally unusable, or, that you’ll spend the equivalent amount of time in excel or google refine getting the data to do what you actually want to convey. Continue reading