PyGotham, Open Source and Bad Data

A month ago a good friend of mine asked me if I would consider giving a talk at this year’s PyGotham. I enjoy speaking at conferences, so I almost never turn down an opportunity. Python being one of my favorite languages the only issue was what to talk about.

“Can I do it on debunking other people’s data science?” I asked.

A lot is written about and lectured on various libraries and modules to load up data and do all kinds of analysis. Particularly in the python community. Python is slowly becoming the language of data. True there are other options that are specifically tooled for analysis (R, Julia, etc) but python combines a lower barrier to entry (easy to learn, already installed), powerful options, and a large active community.

What we don’t talk a lot about is data quality. Most talks on data science start off with “so we take our data and…” with very little comment on getting the data and prepping the sample. And yet it’s these two stages that are the most complicated, require the most careful thought, and where mistakes are the most damaging. (Just today the NYT published a piece on this very problem)

There are also very few tools to help identify and prevent these issues.

My original plan was to walk people through these types of mistakes with real world examples. I was a little concerned about how little python would actually be included in this discussion, but I kept telling myself data science is a huge part of the python community and this is a huge issue in data science. These are not problems that only trip up students and the intellectually inferior. It was ridiculously easy for me to find examples from major publications like The New York Times and respected blogs like FiveThirtyEight. The consequences of bad data are everywhere.

Then in the middle of  the conference a bolt of inspiration hit, we write code that tries to validate and unit test our processes all the time. The main problem with data science libraries is that they assume that you understand all the caveats and proofs associated with each model. But why can’t we write a library that would analyze a dataset and give you feedback on structural issues, potential sampling errors, normalization issues, etc?

So I went home, bought a six pack of beer, ordered a pizza and changed half of my presentation.(1)

We Are Open Source

We have been working on isolating and open sourcing different components of Exversion‘s technology for a while. It’s slow going, largely because with a small team I always have to choose between building new and modifying the old to the extent that it needs to be in order to click into place on someone else’s stack. Mind you, we have tweaked our processes to build things that way the first time, but it took us a while to get those habits in place mentally.

Right now our major open source projects are as follows:

Junky: Dataset Profiling

The project I pitched at PyGotham is called Junky, it got an amazing response with four or five people volunteering to be contributors right on the spot. While I originally called it a dataset validator for lack of a better term. Eric Schles smartly described it more as a dataset profiler… which I think better captures what it will be able to do. A profiler doesn’t tell you that your code is good, just how much time it takes to execute, how much memory it uses, etc. From there you may choose to clean things up, or you may not. The profiler’s job is to show you what’s going on that you might not see otherwise.

Likewise, Junky will not tell you if your data is good or bad, but it will measure things like how consistent your categories are, test for normal distributions, how robust your sample population is, identifying outliers, etc. These are things that anyone with any formal training in data science learns to do before cracking out a linear regression model, but with a lot of people coming into the data science frontier from programming, sometimes the basic first steps are missed. Self-taught data scientists sometimes jump directly to executing commands in a stat library, without knowing very much about the requirements of the models they’re using. Most of the common statistical analysis methods assume, for example, that your population data has something resembling a normal distribution (ie – a bell curve) but it is ridiculously easy to collect a sample that isn’t normal in the statistical sense and that will be something you want to know before you do your analysis and draw conclusions.

When you’re self-taught you learn best through mistakes. Most major mistakes in programming blow up in your face really quickly. But in data science critical errors can go unnoticed for long periods of time, crippling the passionate beginner’s ability to learn and improve.

Junky is intended as a way to help people who want to do data right, explore those problems on their own.

Data Cleaning Boilerplate

Another general data tool I’ve been working on is the Data Cleaning Boilerplate. The concept is simple, I write a lot of data cleaning scripts. Usually I end up writing new scripts from hacked together, copy-and-pasted bits of old scripts. At some point I decided it would be really useful to start writing more generic functions for things I end up doing over and over again so that I can copy and paste from the boilerplate and get things done faster.

Exversion Layer

Layer is a stand alone version of Exversion’s version control system. It hooks into Postgres and uses a RESTful API to receive and return changes in data state.

Exversion Server

Exversion Server will be an open sourced version of Exversion’s data store technology. While there isn’t much to look at now, I’ve put this project back on the main dev schedule for the coming months largely after conversations with Rufus Pollock of OKFN and the HDX team at the UN. CKAN, which HDX uses, has a data store module that appears to be setup exactly the same way we set up the prototype of Exversion that we hacked together on the bus down to SXSW. We ditched that model for very good reasons when we came back up to NY and so I think it will be worth while to release something that can be hooked into CKAN as an alternative. Will blog a more detailed explanation of my thinking in this regard when that is ready to go.

All of these projects welcome potential contributors, so if you’re interested please file an issue letting us know what you’d like to improve about them.


(1) I’ll put the final presentation online as soon as PyGotham releases the video. For now here are the slides.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s