First let me say that I love Github. It is a hacker’s paradise and the perfect platform for what it is actually designed to do: share code. But lately Github has started pushing the idea that Github is an appropriate platform for everything from novels, to tutorials, to datasets. And while I’ve seen some truly brilliant ways of arranging a repository to do chapter by chapter instruction, the problem with releasing data on Github is that the same structures that make Github the most efficient solution for hosting code makes it a frustrating and inefficient solution for hosting data.
Nevertheless people do just dump data on Github, dump it there and hope that other people will actually use it. More and more people are dumping data others have already cleaned in exactly the same way on the same platform. If they could find the data on Github in the first place they could have spent their time building something else.
The question for us became how do we turn this into an advantage?
Time to come clean: I did not originally build the ability to import from Github for users. I built it as part of the admin dashboard.
I kept finding interesting data lost and ignored on Github. I kept downloading these files, creating data repos for them and uploading them to Exversion. It was satisfying… a bit like a treasure hunt, but it took up a lot of time. As a hacker when you do the same thing enough times, knowing that you are going to do it many more times in the future, you begin to think seriously about automating it.
The Great Scavenger Hunt: Finding Data on Github
Unless you’re linked to it directly or know the organization/person releasing it, finding data on Github is a pain in the ass. Github search does the reasonable thing and weights their search results by repo activity, however the overwhelming majority of their community is interested in code, not data. If you’re searching for something like Ebola data, the right repo pops up immediately. But if you want something like flight data most of what comes up on Github are apps and pet projects where the word “flight” appears in the title or the description.
Github allows you to search by filetype, which is useful, but will assume you want to query inside files. In other words the query “flights extension:csv” will return csv files with the word “flights” in them (or in the file name) and not repositories that match flights and have csv files. You cannot run a filetype filter without a search query.
So once again, even if Github was the perfect solution for hosting data (which it is not) it can be very difficult to find the data that’s up there. We can’t harvest the data from Github if we can’t find it on Github. This was our first problem.
Luckily there is a service that can search Github and find all the csv files in public repositories. It can even filter it by time period so that every day we have a timely listing of new data to steal.
It’s called Google 🙂
Link Building Through Pull Request
Once I knew where the files I wanted were, importing was pretty easy. Github follows nice, orderly, predictable url patterns. I could download the raw csv file, reuse the repo’s metadata from Github’s API, and put the whole thing in the queue for Exversion with just a click of a button. But I wanted more. I wanted some way to reach out to the people struggling to use Github as a data sharing solution and let them know that we exist.
So once I confirmed the data had been imported correctly, I automated the process of forking the original repo, editing the README.md file to add a link back to the “mirror” on Exversion and committed the change back to Github.
Let me tell you, who ever designed Github’s API is a very smart guy because there is one thing you cannot do through the API and that is create a pull request across forks. Working on this project made me realize how amazing it is that Github has not had the spam issues of other large, social websites. I suppose it would not be unfair to call what I was doing the world’s first pull request spam (but then again there are a lot of weird and wonderful things that go on via pull request) and I do feel a bit … dirty about it. Like, sure it’s incredibly evil, that doesn’t bother me … but it … mmm … inches a bit too close to an ethical line here.
At the same time, most everyone I’ve contacted through pull request has been incredibly cool. By preventing me from automating this last step I had to take the time to write a short message explaining that I’d found the data interesting enough to mirror on a site where it could be accessed via API. Most people putting data up on Github understand the convenience of having an API and were more than happy to accept my pull request.
It was a simple win-win outreach strategy: because even if the user never checked out Exversion to see what we offered over Github, we still got a nice link back from a high quality domain. And if the user accepted our pull request? That link back became even more valuable!
Github for Data
One of our most important goals in developing Exversion is trying bring together a community of data enthusiasts. There is no gathering place for people who love data: we cross too many age groups, industries, technical skill levels. But what sites like Github ultimately prove is how communication and collaboration within a community can incubate innovation. What would technology look like today if Github didn’t exist? Would languages like Ruby and Python dominate? Would Julia or Clojure ever gotten off the ground?
Traction is important, but far more important is reaching out to like minded people who will ultimately appreciate what you are trying to build.