Will Doctor For Food: Exploring Medicare/Medicaid Open Payments Data

About a month ago, an alert came across my desk (well… metaphoric desk anyway): the Centers for Medicare & Medicaid Services had released updated data downloads for their Open Payments program. When I followed the link through to check it out the following warning greeted me:

Some datasets, particularly the general payments dataset included in the zip file containing identifying information, are extremely large and may be burdensome to download and/or cause computer performance issues. […] Be advised that the file size, once downloaded, may still be prohibitive if you are not using a robust data viewing application. Microsoft Excel has limitations on the number of records it can display, which this file exceeds.

Indeed some of CMS files are as much as a GB of data. And here I thought “Hey, I have a company for this” (so yeah, if you want to poke through the CMS Open Payments data all of it is on Exversion right here)

APIs are nice ūüôā

That being said, one wonders exactly what you can do with Open Payments data. It’s natural to look at the words Medicare/Medicaid and assume these are all medical bills, but actually it’s a lot more interesting than that: [1]

This data lists consulting fees, research grants, travel reimbursements, and other gifts the health care industry ‚Äď such as medical device manufacturers and pharmaceutical companies ‚Ästprovided to physicians and teaching hospitals.

Well now that sounds pretty nefarious. I mean come on, we all know that the money moved around through gifts and grants influences the type of treatments doctors recommend. So now the government is giving you an opportunity to look directly at that activity.

The fact that they took something really interesting and wrapped it up in the most uninteresting way possible is to be expected. It’s a government thing.

Identified -vs- Deidentified Datasets

If you check out our¬†collection of CMS data¬†the first thing you’ll notice is that each data type is split into two different sets: identified and deidentified datasets. This is no–, as I first assumed– the same data with identifying information removed (I admit that this wouldn’t actually make any sense to begin with but in my defense I’ve seen the government do MUCH worse with their open data). Instead the de-identified is a collection of cases where some of the necessary data about who received what is missing or ambiguous.

Otherwise what the CMS released fits three categories:

  • General Payments: Payments or other transfers of value not made in connection with a research agreement or research protocol.
  • Research Payments: Payments or other transfers of value made in connection with a research agreement or research protocol.
  • Physician Ownership Information: Information about physicians who have an ownership or investment interest in an applicable manufacturer or GPO.

Looking At the Data: Who Get the Most Research Dollars?

Essentially what CMS has released is just a dump of their database. Each files has what feels like twenty or more columns, most of which have no information in them. The benefit of accessing this data through an API as opposed to downloading the file and trying to work with that is that we can segment the amount of data we’re looking at before committing any computer memory to the task.

The first thing we did was rearrange Research Payments to look at how much money each state received for the year 2013. Because this is a smaller dataset, we wrote a python script to iterate through each page of data returned by the API, sort through and rearrange as needed. This is not recommended for super large datasets as you will hit our api’s rate limit pretty quickly, but for this size it wasn’t an issue. We used python to write nice clean json we could copy and paste into d3.js and create an interactive map (click through to see)

interactive map

Along the way we discovered something funny. All the payment data was between the months of August and December. After a little research we discovered that this is a relatively new thing for CMS. The mandate to release this information was part of the Affordable Care Act. As 2013 is the first year, they could not collect a full year’s worth of data.

That means 2014’s files will be EVEN LARGER.

Will Doctor For Food

Anyway, we wanted to poke around this General Payments file, it seemed like the most interesting stuff would be there. But the identified version is over a GB… kind of unpalatable.

Luckily with Exversion we can take a sample and play around with that instead. How about 50,000 records? Fetching 50,000 records and analyzing them took seconds. All we had to do is add the ‘_limit’ parameter to our request:


I bet if I told you Big Pharma was paying physicians and teaching hospitals in FOOD you wouldn’t believe me, but here’s the breakdown of that 50,000 record sample:

Type of Payment Number of Payments Total Amount
Food and Beverage 40302 $1,037,941.01
Gift 55 $80,035.93
Consulting Fee 924 $1,917,576.16
Grant 382 $3,702,231.73
Travel and Lodging 2710 $862,276.81
Compensation for serving as faculty 90 $194,884.57
Royalty or License 100 $6,913,971.38
Current or prospective ownership or investment interest 4 $529,830.08
Entertainment 108 $5,977.24
Compensation other services 2787 $3,256,475.94
Honoraria 27 $95,666.70
Education 2436 $456,251.12
Charitable Contribution 14 $97,773.60
Space rental 61 $72,862.55

The interesting thing here is that in our sample the vast majority of payments are tiny amounts related to wining and dining doctors and hospitals, but that does not add up to the most money spent. No, much more money is spent in royalties and granst, but to only a handful of institutions.

So there you go. Now you can play around with CMS’s Open Payments data without worrying about choking your computer. Can’t wait to see what the rest of the internet does with this.

Blaming Victims: How Stats Frame Our Perspective

We all know that you can manipulate the way statistics are presented to change their meaning, but you probably haven’t given much thought to the way their presentation affects how you see the world. I’m not talking about trusting a misleading statistic and believing in a policy or position that isn’t true. I’m talking about the way statistics influence how we define problems and consequentially what solutions we spend time and energy looking for.

Consider crime. Most crimes have two sides to them: those who commit the crime, and those who are victims of the crime. But the statistics we collect are inevitably obsessed with the victims. Google “Odds of being murdered” and hundreds of reports come up with authoritative sounding numbers. Here’s one from The Economist. Here’s another from Yale University.

Now try finding the odds that you will BECOME a murderer.

That’s not nearly as easy, which is odd when you consider it is virtually the same set of numbers that we have already collected. We just need to change what we’re counting.

And yet… Here’s an informal back-of-the-napkin calculation from Deadspin on your odds of knowing a murderer. Here’s something from Reuters¬†about gun ownership increasing the risk of suicide or murder.

There are of course studies exploring the odds that a convicted murderer will kill again. There are odds of you getting AWAY with murder. There are stats on the male/female breakdown of roles when murder’s do happen. But there are virtually no statistics on your odds of one day killing someone.

On the surface this might seem like a trivial, almost obnoxiously pedantic issue. Why would anyone ever need to know their odds of committing a crime? You can control whether you commit a crime! Being a victim of a crime involves a certain amount of chance, so of course knowing your odds and how those odds are influenced by certain factors must be useful in protecting yourself.

But there’s one very big problem with this type of thinking: it automatically focuses us on solutions that prevent (or otherwise decrease the odds of) victims becoming victims, instead of preventing criminals from becoming criminals. From an individual point of view, looking for ways to minimize your risks makes a lot of sense. As an individual you can’t control anyone else and you might have little recourse after the fact. You focus on your decisions and behaviors because that is what you can actually do something about.

However, the same cannot be said for society as a whole. Society does have the ability to tell people what to do, and the power to enforce consequences when those prescriptions are violated. One would think society would also have invested interest in minimizing the number of criminals. Criminals, generally speaking, are not fully productive contributing members of society. From the society’s point of view criminals cost way more than victims.

And yet we devote a fraction of the narrative to exploring the factors that lead to people becoming criminals. About the only time you will see any statistics on this topic is when discussing low income neighborhoods, and even then the stats are usually the odds of a person ending up in jail.

Not everyone in jail deserves to be there.

Sexual Assault and Statistics
You cannot possibly develop a solution for a problem if there is no discussion of the problem in the first place. If the conversation does not happen, people do not think about it. If people do not think about it, they do not recognize opportunities for solutions.

And in this case, by ignoring one half of the criminal-victim dynamic, we may also be ignoring the most effective solutions.

Consider rape. You probably realize already that a lot of the potential “solutions” to sexual assault end up asking potential victims to submit themselves to a ridiculous series of seemingly arbitrary dress codes, behavioral rules, and institutionalized paranoia. When those provisos fail it is assumed that the victim did not follow them carefully enough.

There are many situations that may lead to a sexual assault. Walking down the wrong street. Wearing something provocative. Getting drunk at a party. Dating a creepy guy.

Yet the same situations could just as easily NOT result in rape. For all our work collecting stats to protect victims, we actually don’t have much information as to why that is or how much this presumed “bad behavior” actually increases your risks. Unlike the lack of data on criminals, this isn’t a deliberate bias. Such data is really hard to collect.

Nevertheless, the consequences of framing the problem of sexual abuse with the odds of becoming a victim are that solutions that perspective provides are not all that effective at minimizing the rate of sexual abuse. After all, if wearing the right things, not hanging out with strange men, not going out alone, prevented abuse Saudi Arabia would have the lowest rate of violence against women in the world (spoiler alert: it doesn’t)

Really all the victim bias does is enforce a state of terror in the perceived potential victims … who in fact might not be the most likely victims in the first place (for example the majority of rape victims in the military in 2012 were men). So it’s not just a state of abject terror, it’s a state of pointless¬†abject terror.

What would happen if instead of having stats like this beaten into our heads at every conceivable opportunity:

– 1 in 5 women will be raped
– 30% of them will be raped by people they know
– Every 2 minutes someone somewhere in America is sexually violated

…we were constantly reminded of stats like these (all made up):

– 1 in 5 men will commit rape
– You are ten times more likely to rape your partner than a stranger
– Every 2 minutes someone in America is sexually violating someone else

Even though the second set may seem unnecessarily antagonistic (almost Minority Report-esque in its assumptions) it has the unique effect of changing the focus of the problem. While there are millions of uncontrollable and unpredictable contributing factors that might lead up to a victim being raped, there’s really only one factor that leads to a person becoming a rapist. Rape is a choice– perhaps not always a MALICIOUS choice (ie – statutory rape), but nevertheless a choice. No one accidentally rapes another person. No one commits a rape because they happen to wear the wrong thing. One could argue the occasional outlier case of rape-by-miscommunication, but one can’t deny that if the focus of the public conversation was “how do we keep people from becoming rapists?” rather than “how do we keep people from getting raped?” would-be “unintentional” rapists would probably take more precautions in ensuring consent is clearly articulated, thereby eliminating these cases.

In other words, reframing the statistics changes how we try to solve the problem to emphasize decisions we actually have control over. As a woman I cannot predict what hemline is long enough to avoid provoking lurking rapists, but the rapists themselves can easily choose not to rape.

It is tempting to assume that because we theoretically welcome free and open discussion, we are able to see all sides to an issue with very little mental exertion. But really we are programmed to see certain sides and rarely if ever look beyond that. Statistics in this sense provides a false sense of security because it is not obvious how they can be framed to completely remove large parts of the situation from consideration.

Devil’s Advocate: Don’t Use Open Source

Do to the nature of our work, we frequently collaborate with non-technical organizations. This is the first part of what we’re calling our “Devil’s Advocate” series, intended to help¬†non-technical people make better technical decisions by presenting the negative side to current tech trends.

Don’t use open source.

Don’t get me wrong, I love open source. I use open source projects and I contribute fixes back to them whenever I can. Open source is great.

But open source¬†does not make solutions instantly better, more democratic, or more transparent. A lot of organizations– particularly nonprofits– insist upon open source because the philosophy of the open source movement matches their own ideals. Well, that’s nice, but technical decisions should be made based on technical arguments. Open source is the right solution when there’s an open source project that fits what you want to build.

Far too often nontechnical managers ask the dev team to take an open source project and completely alter the functionality without realizing that it’s harder and takes more time to do that than it would have to build something custom.¬†The manager thinks using open source should make the project cheaper and speed up turn around time, if that doesn’t happen he gets frustrated by the work ethic of the programmers. “Why isn’t this done already?” he might ask. “The open source project works out of the box, all you need to do is just make these tiny changes.”

Granted there are considerable advantages to going with an open source solution. It is easier to onboard new devs. You benefit from an established community of devs without having to pay them. You have a mature, fully functioning platform to test various ideas on.

But that last advantage is exactly why open source might be the wrong solution for you.

As code “matures” good developers reorganize it to remove duplicate functions and make it easier to maintain. From a nontechnical mindset it is difficult to understand what these periods of “refactoring” (translation: code clean up) are really doing or why they are necessary. Too often nontechnical managers see refactoring as a period where nothing is getting done because the devs were sloppy and have to go back and fix their mistakes! But all devs, even the best in the business, write code that eventually has to be refactored. It’s an important part of how software evolves.

Imagine we had two functions: send_email which takes values from a form, generated a request to the server to send an email … and forward_email which takes values from a form plus values from a database and generated a request to send an email. We wrote send_email before we knew that we would ever want forward_email, but when someone requested the ability to forward emails that sounded like a good idea so we quickly copy and pasted send_email and changed a few things before renaming it forward_email. Now suppose that we realize we have made a mistake writing the code to generate the request. Or maybe we hadn’t made a mistake at all, maybe we upgraded to a new version of something and that required a change. Both send_email and forward_email need to be updated… but if we had created a third more generic function (say generate_mail_request) and moved the code that wrote the request out of send_email and forward_email, then we would only need to change generate_mail_request.

This is the concept of “abstraction”, building reusable bits of code that several parts of the application will use to do the same thing.

The process of minimizing and reusing code is essential to project growth. However, it can also make changing small things more difficult if the change conflicts with the way the program was designed to behave. This is why taking an open source piece of software and making “a few changes” may in fact take longer and be more difficult than writing the same software from scratch. If the open source project is mature enough it has probably been refactored a few times, creating several layers of code that effect hundreds of features and functions.

Assessing the Ease of Your Desired Change

At the same time most mature open source projects do in fact assume you will want to change things around a bit. Their programmers work in pathways where certain changes can be made quite easily. The trick is understanding whether the customizations you want fit into that system or not.

As a general rule it is usually possible to change the way an open source project looks. If the project has a templating system, there should be an easy way to override the default templates with your own.

For mature and complex projects it is also usually possible to add new, independent features pretty easily. Since these tasks leave the core behaviors alone they only need a place to hook into so the open source application knows it’s there.

The big question mark is always how much of the existing behavior can be changed, that will depend on whether what produces the behavior in question is a link in the template, client side javascript, or something in the server side code and if in the code, how deep?

Open Source as Parts of a Whole

Now that you’ve heard all the consequences of picking an open source project over a custom solution, there’s good news: the best projects use both custom solutions AND open source. There’s an open source solution for practically every major feature you need, designed to slide easily into place with all kinds of different applications. So if you truly need something unique and there’s no open source platform that fits the bill. You can hire a developer to build something custom, but fill it with open source parts so that you’re not wasting time and money building things that someone else has already built. Pop in a login system, image uploading, commenting … all areas where there are several good open source options available.

And when you’ve finished your custom piece of software … you should definitely open source it ūüôā

Public Speaking Hacks for Conference Season

Two years ago I took the stage during The Feast¬†to explain why developers don’t work with open data. My presentation took the nontechnical audience through such concepts as brute force hacking and formatting inconsistencies, got me a standing ovation, a stack of business cards a mile high, and ultimately set me down my current path.

Yesterday I returned to The Feast to give a 90 second pitch about a project I’m developing for Exversion, again it got a great response– influential people sought me out rather than me having to stalk them through the crowds– and I found myself approached by a lot of would be entrepreneurs asking for tips on giving an awesome presentation.

So I decided to write a quick guide with some of my best advice for pitches and longer form public speaking.

Find Three People

Having a crowd of hundreds of people focusing on you is really scary, so the first thing I do when I start my presentation is find three people in the crowd to make eye contact with. One on the left, one on the right, and one somewhere in the center and towards the back. Throughout the course of my talk– whether that talk is 60 seconds or 60 minutes– I keep coming back to these three people and making eye contact with them again. Usually I develop a little rhythm, jumping from left-person to center-person to right-person back to center-person, left-person… etc.

This technique has a couple of different benefits. First of all it brings the scale of the conversation down. You are no longer giving a speech in front of a massive group of people, you’re just having a conversation with three people. A casual conversation with three people is easy, right?

The second thing it does is give you a really easy way of judging audience response. Those three people will smile, they’ll nod, or they may start to fiddle with their cellphones … in which case you need to change something up on the fly.

The third thing it does is get the attention of everyone around those individual people. Think of how many times you got freaked out by someone staring at you across a room only to realize they were actually staring at someone behind you? That’s the benefit of crowds. You’re really looking at one person, but everyone around that person is going to feel like you are talking directly to them which will make them pay more attention.

The last thing it helps with is establishing a soothing rhythm for your body language. Who do you think comes off as more confident and compelling: a speaker who stands in one place, swaying nervously and staring out into the crowd, or a speaker who moves around in a smooth orderly pattern? This is true even when you’re using a podium, the speaker who turns her head to look at specific people seems much more natural than the speaker who occasionally looks up from her notes to stare into the void. Ultimately when people feel like you are talking to them rather than at them they are much more receptive to the content of what you are saying.

Memorize Phrases, Not Scripts

There are some people who can write out every single word before hand, memorize it, then deliver a spectacular presentation, but most of us end up obsessing so much over remembering the exact order of each perfect sentence that we lose the emotion and energy we need to engage the audience. The power of perfect wording is a myth perpetuated by Hollywood screenwriters. It’s not the exact words that people respond to, it’s the enthusiasm and confidence of the speaker. After the presentation no one is going to remember exactly what you said, what they will remember is how they felt while you were speaking and whether you were likable on stage.

That being said, sometimes you do have key phrases that you want to make sure you use. They maybe taglines or core concepts or they may just be good transitions. The first and last lines of your presentation are good things to plan out, the first because it is the sentence in your presentation with the most potential power, the last because the absolute worst thing in the world is to get through your entire presentation then end it with something like “uh… well that’s all I have to say, thanks.” (we’ve all done it)

So in planning out your presentation, figure out what the key phrases are and memorize those. Get used to saying them over and over and over again, so that you only need to say the first word and the rest of the sentence comes out automatically, without thought. For my Feast pitch yesterday my key phrase was “During a crisis the most expensive resource is time”. I practiced that phrase over and over and over again the day of the pitch because what I was talking about was how poor data infrastructure¬†can delay organizations like FEMA or the Red Cross literally for days. I needed people to understand that you can’t send people into the field without any clue where they’re going, what they might encounter, or what they’ll need to help. Every minute spent trying to get the information relief organizations need to deploy is a minute spent not helping. Blankets and bottle water are cheap, time is expensive.

Treat Your Points Like Legos

As much as possible strive to make your content modular. This is essential when you don’t have a slide deck to structure your talk, but it also applies to what you plan to say within the bounds of a single slide if you are working from a deck (because Lord knows you do not want to fill up a single slide with everything you plan on saying). Since you haven’t memorized your whole speech it’s natural that in the moment you may recall points in a different order than what you originally planned on. Do yourself a favor and try to avoid chaining thoughts together. The more Point B depends on the audience understanding Point A the more of a colossal screw up it is when you accidentally forget to make Point A and go immediately to Point B.

At the same time keeping things modular means you have options when things just aren’t working. You should give yourself the ability to skip sections or examples if you find your talk going slower than you originally thought it would or when the audience doesn’t seem to be responding. It’s important to try to predict the type¬†of audience and the tone of the event during the planning stage, but even the best speakers make mistakes in this regard. The more brittle the structure of your presentation, the more likely it is to feel like a run away train. You know it’s not working, the audience knows it’s not working but you’ve given yourself no break points where you can add or remove content on the fly.

This may seem like a huge challenge, but it’s really not that hard. Just think through what you want to say in two-three sentence blocks. Get a general idea of what those sentences should sound like, don’t obsess over the exact wording. Think of it like a mini-essay: the first sentence should state your point, the second sentence provides an example, the third sentence a solution or conclusion or restate your point (depending on the situation). Experiment with different wording each time you practice. Most people screw themselves over in their transitions, because we don’t think about the meaning of the words in transitional phrases we just say them automatically to fill the gaps between our thoughts, but when we’re trying to get a smooth, eloquent speech together we need the meaning of every word to align correctly.

As you practice stating your point, providing an example illustrating your point, then concluding you will realize there are certain transitions that you should absolutely avoid using because they paint you into a corner, leaving you with no smooth way of getting to the next point. There is no master list of phrases to avoid, it depends on what you have to say. By speaking naturally instead of from a script you will find these bad transitions as you practice and know how to avoid them.

Time Yourself Once, Then Forget About It

A little known secret of presentations: it’s very rare to pull a speaker off or interrupt them when they go over time. You always have a little clearance proportional to the total amount of speaking time, and depending on the moderator you may have even more wiggle room than that.

Interrupting a speaker is awkward, no one wants to do it and so organizers will usually put off¬†doing it until they absolutely have to. You can get an extra five seconds out of a sixty second pitch, an extra minute out of a four minute demo, so there is no Earthy reason to obsess about time. Chill out, speak at a pace that is comfortable for you. Take the occasional dramatic pause. If there’s a clock displaying your time, ignore it. No one is going to pull you off stage with a giant hook and you certainly don’t get a prize for finishing early.

You should time yourself while planning your talk so that you know approximately how much content you can fit into the allotted block. There’s a big difference between needing a little more time to finish your conclusion and passing your limit when you still have six more slides. If the organizers get the sense that you are wrapping things up they will let you finish. If they feel like you’re just about to launch into a whole new section of material they will cut you off. Experiment with and test how much content fits inside your block. Remember that saying something on stage is a different experience from saying it sitting down in front of a computer. There are always little seconds gnawed away by the logistics of being in front of others. Your slides may not advance immediately, or you may be stopped by audience laughter or applause. If you find yourself hitting the time limit exactly, trim a little fat and time yourself again to account for these issues. Once you can do your presentation under time once (maybe twice just to be sure), throw away your timer and just forget about that aspect of it.

Own Your Stage

I was not the sole entrepreneur invited to pitch in front of the Feast this year. I was one of eight given the opportunity to come up and talk about what I wasdoing. During prep the moderator explained her plan to keep things moving smoothly: we’d all go up on stage together, lined up in the order we were presenting in so that there would be no awkward logistical issues moving from one presentation to the next.

And that’s how it went, one by one, down the line until we got to me where I did something truly shocking.

I stepped off the line, and walked towards the front of the stage to give my pitch.

When your presenting, no matter how much time you have, no matter how early or tractionless your project is, no matter how unimpressive your expertise, no matter how little physical space you actually have, YOU OWN THAT STAGE. It is yours and you need to act like it is yours until you give it up or someone drags you off!

By staying on the line,¬†the audience had a constant reminder that the person speaking was just one of many. It diminishes you, the audience ends up distracted by thoughts of the rest of the line… how many people are left? What might they talk about? Ooo look how the guy who just went is still fidgeting! Stepping forward says this is my space and my time to talk.

After we came off stage all the entrepreneurs before me said “I wish I had stepped forward too” (and actually the moderator told me “I wish I had put you first so that everyone else would have stepped forward after seeing you do it!” haha) It wasn’t that they didn’t think of doing it, it’s that they wanted to follow the rules. They were afraid of doing something different and having their right to speak taken away. The difference between a good presentation and a great one is not language, it’s your ability to convince people that you command the stage. Which means you need to believe the space is yours and you can do with it whatever you please.

Imagine how much more authority you will seem to have when it looks like you are so important that the organizers will let you do whatever you want in the time you’ve been given to present. Imagine how much more people pay attention when they think someone important is speaking.

Principals in Practice

Here’s how I prep for a presentation. First, I don’t really put much work into it until the week or day of (depending on the length and number of slides I need to create of course). I get excited about what I want to say and that excitement is useful on stage so I want it to be as fresh as possible. For a 45 minute presentation I’ll probably do the slides the weekend before (this tends to annoy conference organizers who want a clear proposal months in advance in order to weed out bad presentations, but I get most of my invites directly from people who have seen me speak, know how good I am already and are willing to trust me). For a sixty minute pitch I’ll do a script usually an hour or two before. I never do the same presentation twice, even if I’m talking on the same topic.

I would rather have a complete meltdown on stage than come off sounding canned and rehearsed.

For that reason I abhor pitch practice and avoid the formalized practices sometimes run by organizers at all cost. I think nitpicking wording and style is the absolute worst way to coach someone into giving a good presentation. Organizers who insist on it are really only doing it for their own peace of mind– to soothe the urge to micromanage– not to help you. I am always terrible at pitch practice (ALWAYS) so if an organizer twists my arm about it, I’ll do my best to get through practice with a passing grade, completely disregard 90% of their feedback and waste no more thought on it.

For pitches (ie – talks under two minutes) I will start the prep process by writing out a script. When time is that tight every second counts so it make sense to invest the time and effort organizing your thoughts. The process of writing the first draft helps me identify time munching digressions. When you’re excited about what you want to say you’ll be able to think up all sorts of compelling arguments, however some of these arguments won’t be as easy to express in the allotted time. The script helps me find those thoughts that aren’t worth the time and energy they eat up, so that I can cross them out and focus on what points can make the greatest impact in the smallest time frame.

Once I have that script I read it through a few time, then close my eyes and make sure I can flow smoothly from the key phrases that I’m trying to memorize to the sentences where I’ve decided the exact wording doesn’t really matter. If I’ve having trouble remembering what comes next I may modify a key phrase to better setup the next thought before I start memorizing it. This isn’t as complicated as it sounds, remember you’re talking about something you know inside and out. All you need is a little hint.

That settled I then write a quick outline and throw out the script. My outline is not a proper traditional outline, but more like a list of key phrases and summary of points in the order they should probably go. This is what my outline for the Feast pitch looked like:

– time is expensive
– types of data needed
– who needs the data?
– one infrastructure
– one infrastructure that hooks into anything

Then I just start running through the pitch until I feel comfortable with it. I feel like a perfect run through just jinxes you, so I try not to obsess about hesitations or minor errors. Every time I’ve blanked out on stage it’s because I’ve over practiced. While it’s useful to have key phrases so committed to memory that you can finish them without thinking, when you’re presenting you really want to be aware of what it is you’re saying. It’s easier to recover from mistakes that way. If I feel nervous before going on (and I always do) I will look over my outline rather than trying to repeat the whole thing perfectly from memory to prove I can.

If I have more than two minutes I skip the script phase completely and use my slides to structure and guide the course of my talk. If I have any key phrases I want to use I will put them directly on the slide so that they are impossible to forget. I’ll run through the slides a few times and if there are places where I feel the transition between one idea and the next is too fragile, relies too much on getting exactly the right sentence in place, I will add a slide in between those thoughts to better lead from one idea to another. You can never have too many slides!

Ultimately the most important thing is that you feel as comfortable as possible when you’re up there. How you feel directly before you go on doesn’t matter because we’re all freaking out at that point anyway, but when the attention is on you, you should enjoy yourself.

Startup Divorce: Six Conversations To Have With Your Cofounder Before You Go Into Business

Shortly after the controversy over Tinder first hit the news I was sitting with a group of former and current founders having lunch. There was an interesting juxtaposition between the way we viewed our own experiences with cofounders and the way we viewed Whitney Wolfe’s situation. Each one of us condemned the stripping of Wolfe’s cofounder status and the diminishment of her role in the company’s success. However, when it came to our own departed cofounders each one also admitted to wanting to figure out a way to seize back equity and rewrite the company history.

Working with a cofounder is like getting married, and breaking up with a cofounder is exactly like getting divorced. It is not fun. VCs and startup gurus talk a lot about the importance of team and cofounder dating, but very few people give specific advice about what you should be looking for. How do you know when you have a solid match? How do you find the red flags before equity is divided up?

For the most part my experience with cofounders has been overwhelmingly positive, but even within that sample there were some red flags I ignored that came back to bite me in the ass later on.

Here are six conversations you should have with your cofounder before you tie the knot:

Ask them how they would want to exit in a perfect world

You’re thinking IPO, they’re thinking acquire. In the end you might not get either. It may not matter, but the two goals have completely different timetables and often require completely different strategies.

Ask them where they see themselves in five years (no really)

After a year, your startup will have burned through most of its buzz. Tech blogs won’t really care if you’re killing it, they want shiny new products to write about. After three years if you’re not rolling around in a giant pile of cash everyone assumes your dead, broke and humiliated by your failure. Yet the reality of startups is that they take years of long, hard work. A cofounder who sees himself as only being a cofounder for two or three years is going to be trouble well before the company’s first birthday.

Pick something that fits their role and ask them how they would set it up

This isn’t about quizzing your cofounder, it’s about seeing how well they can explain things to you and whether they are aware of the consequences of the choices they would recommend. When you start a startup there’s a lot that needs to be setup and a lot of established companies willing to throw free accounts and credits at you. If your cofounder leaves, the remaining cofounders need to understand the terms of those freebies, when do they run out, how much will they cost when they do?

Give your cofounder time to research a full and complete answer to this, but press for details. If they don’t realize that you’ll have to file a change with the state to remove a register agent (a common trick used by companies offering cheap incorporation services) that could create a problem later on.

Pick a topic specific to what role your cofounder will be playing. If a technical one, ask about server setup. If a business role, ask about taxes. If a marketing role, ask about social media.

Ask them about their past jobs and startups

It’s not about qualifications, it’s about interests. Do they seem to jump from trend to trend, more concerned with striking it big than with following their passion? It’s easier to weather the stress of startup land when you love what you’re building. The number of sacrifices you’re willing to make increases and the burdens of startup life seem less demoralizing. When you’ve chosen your idea based on an assumption of what will be popular, popularity becomes the glue holding the team together.

Fair weather cofounders are the first to screw you over and the first to fight for more equity if the startup takes off.

Could they live on $1,000 a month? How close could they get to that level?

Would they move to Detroit? Live with their parents? Dumpster dive for food? It’s not necessarily important that they do any of these things, but understanding what cofounders consider essentials is a really good indicator of how long they’ll last. A cofounder who isn’t willing to drink tap water and has to live in certain trendy neighborhoods is going to be under way more pressure than one who considers those things expendable.

Also worthwhile asking them if they’ve ever actually cut back to that level. We all say we’re going to budget better just before we splurge.

Do they have any skills they can leverage to bootstrap comfortably? What about passive streams of income?

You will in all likelihood get to the point where your personal financial situation is negatively affecting the company. It’s not necessarily because of something you did wrong. Opportunities cost money. Here’s what happened with Exversion: YC threw us a little cash to fly in for our interview, but no where near enough to cover costs of the journey. TechCrunch insisted we fly out to SF for rehearsals in order to participate in Battlefield in Berlin. The Open Data Coalition offered us a free booth at their government conference, but we had to get to DC on our own.

I needed a few hundred dollars a month to balance things out, so I went back to teaching ESL (something I had done while traveling around Europe years before).

Having a skill or a resource that can provide injections of cash if needed is really useful. It can be the difference between having to leave the company or not.

If your cofounder says he can always freelance, press him on the specifics of contracts. The nice thing about freelance teaching is that classes have a definite start time and a definite end time. We agreed exactly how much time I would spend on that in advance and it would be extremely difficult to modify those plans.

However, working freelance for other startups or small businesses is different. They act like they own you, wanting you to be on call 24/7, changing project goals and specs on a whim, setting new last-minute deadlines. Add to that the additional work of finding clients, writing proposals, pitching them, negotiating terms, chasing after them for payment … freelance work can and will take over your life

In the end do you really need a cofounder?

No cofounder is always a better choice than a bad cofounder. Bad cofounders can do a lot of damage, often when the startup is the most vulnerable. At the same time, the existence of cofounders is used as an easy filter for VCs and incubators and I suppose that makes a degree of sense. If you can’t convince one person of your product how on Earth will you convince millions?

Analytics Battle: Hacker News -vs- Product Hunt

Product Hunt has emerged as tech’s new darling. All the power of Hacker News, but more curated, posts about the minutiae¬†of programming languages, science and math stripped out in favor of showcasing the best technical projects.

As with all platforms of scale, the larger Hacker News became the more specific the romanticized expectations of it became. Burning idols are always en vogue. So one wonders how much of Product Hunt’s buzz is related to the effectiveness of the site itself, and how much does Product Hunt benefit from disaffected hackers flocking to something shiny and bursting with seemingly endless goodies.

I’ve been on the top of Hacker News a couple of times, and this week Exversion hit the front page of Product Hunt. I’m now in the rather unique position of talking about the impact of both. Not a lot of people can say that.

So … how do they compare?

In terms of pure eyes on the page, the number one position on Hacker News will yield about 1,500 to 2,000 uniques. If that number seems a little low, it’s because the posts that tend to do best on Hacker News are blog posts, code repos and news articles. So you end up driving traffic to your site through another vehicle, inevitably losing some traffic along the way.

Just because a post linking your baby does well on Hacker News, doesn’t necessarily mean you’ll see a significant boost in traffic. My first number one was directly about my experiences getting over my biases against my cofounders and that translated into a lot of interest in what we were building. My second number one was a review of a startup conference. It yielded only about 300 uniques.

By contrast, Product Hunt put 700 unique eyeballs on the screen, much more than a off-topic blog post but much less than a more relevant top HN post. However, while Hacker News hits you with traffic all at once, Product Hunt visitors show up gradually over the course of several days.

Personally, I prefer a couple days of boosted traffic over a spike that only lasts a few hours, but I guess that’s a matter of opinion.


But thousands of visitors who only stay seconds isn’t really that valuable. Obviously you want visitors who browse around, sign up for an account, stay for a while. Here it was no contest. The average Product Hunter stayed on Exversion for 10 to 20 seconds. The average Hacker News reader stays close to 2 mins

WINNER: Hacker News

And here’s where things get interesting, During the rush of new traffic produced by Hacker News returning visitors stayed around 5%. With Product Hunt returning visitors started off at 4% and climbed steadily to 10%. So visitors don’t stay as long initially, but they come back with greater frequency.

WINNER: Product Hunt

Viral Impact
Our first day on Product Hunt we had 300+ direct referrals, and 200+ indirect ones. Indirect referrals come from twitter bots and scrapers that harvest data from Product Hunt and redistribute it in order to boost their own content and activity levels. Get on the top of Product Hunt or Hacker News and these outlets will also pick up your link.

Sites scraping Product Hunt include Panda and The Scoop. Site scraping Hacker News also include Panda and The Scoop, but a wide variety of Twitter accounts attempting to narrow down the firehose.

And here is where the nature of Hacker News gives Product Hunt a distinct advantage: getting to the top of Hacker News is a real challenge. Staying at the top is practically unhackable. Things drop off fast and are engineered to drop off even faster depending on who you displease. Different karma levels unlock more features, specifically flagging and downvoting. There’s a rumor that if a YC-alum downvotes your post it drops immediately to the second page no matter what.

Whatever the case, the fact of the matter is during peak hours your post has only about twenty minutes to break into the top page. Once there it will start to drop as soon as the activity around it starts to wane. Realistically you’re looking at three~four hours of top quality time. If you hit a nerve you might be able to stay at number one the entire day, but rarely if ever do posts stay on the front page more than twenty-four hours.

Product Hunt, by comparison, basically freezes the top 10 every day and displays links by day. So content picked up by syndicators is much more valuable, because it sticks around longer whereas Hacker News can be more of a flash in the pan.

Furthermore, if your startup rates high enough you get featured in Product Hunt’s mailing list, driving an extra spike of traffic later in the week.

WINNER: Product Hunt

You can really only be posted to Product Hunt once. I actually didn’t submit Exversion at all, so while I was happy someone else liked my work enough to throw it up, I was also disappointed that we were up before we had finished planned changes to the main page designed to more intuitively explain what we’re about. The redesign of the front page is a big project we’ve been devoting a lot of time and energy too, and a feature on Product Hunt would have been a lovely cherry on top of that accomplishment, but que sera~

Hacker News, on the other hand, offers infinite opportunities to put yourself in front of new visitors. Just keep blogging and keep submitting. Even posts that barely manage a single upvote tend to yield 40~60 new visitors before they crash. Totally worth it when you figure that submitting is free and you would write the blog post anyway.

WINNER: Hacker News

Conclusions: Breaking off HN niches makes sense
I love HN. However I rarely if ever read it anymore. HN is a firehose of content that simply never ends and barely slows down. Even during off hours there is always new stuff to look at. You could literally spend your entire day reading HN. That’s the fatal flaw for them: the smartest hackers would rather spent all their time hacking, not reading Hacker News.

So I filter Hacker News by topic and break the week’s posts into digests (PS – if you’re into data you can subscribe to these digests here), but still there’s always FOMO. No algorithm is perfect.

Sites like Product Hunt make a lot of sense because they keep good stuff from falling through the cracks. However even this most promising spin off doesn’t come close to generating the traffic of the fire hose. On the other hand, Product Hunt has succeeded where many other “Hacker News for X” attempts have failed because it refines the methodology of HN to fit its own purposes. Multiday boosts in traffic are much preferable to insane fleeting spikes. Visitors who come back a third and fourth time are gold.

All the numbers here reflect my own experiences. They are influenced by Exversion’s unique worts and complications. But I feel confident in saying that the primary difference between the two platforms is that Hacker News will send you traffic, Product Hunt will build traffic.

Building SEO Link Backs Through Github Pull Requests

Continuing our theme of being as evil as possible: you can now import data from Github into Exversion.

First let me say that I love Github. It is a hacker’s paradise and the perfect platform for what it is actually designed to do: share code. But lately Github has started pushing the idea that Github is an appropriate platform for everything from novels, to tutorials, to datasets. And while I’ve seen some truly brilliant ways of arranging a repository to do chapter by chapter instruction, the problem with releasing data on Github is that the same structures that make Github the most efficient solution for hosting code makes it a frustrating and inefficient solution for hosting data.

Nevertheless people do just dump data on Github, dump it there and hope that other people will actually use it. More and more people are dumping data others have already cleaned in exactly the same way on the same platform. If they could find the data on Github in the first place they could have spent their time building something else.

The question for us became how do we turn this into an advantage?

Time to come clean: I did not originally build the ability to import from Github for users. I built it as part of the admin dashboard.

I kept finding interesting data lost and ignored on Github. I kept downloading these files, creating data repos for them and uploading them to Exversion. It was satisfying… a bit like a treasure hunt, but it took up a lot of time. As a hacker when you do the same thing enough times, knowing that you are going to do it many more times in the future, you begin to think seriously about automating it.

The Great Scavenger Hunt: Finding Data on Github

Unless you’re linked to it directly or know the organization/person releasing it, finding data on Github is a pain in the ass. Github search does the reasonable thing and weights their search results by repo activity, however the overwhelming majority of their community is interested in code, not data. If you’re searching for something like¬†Ebola data, the right repo pops up immediately. But if you want something like flight data¬†most of what comes up on Github are apps and pet projects where the word “flight” appears in the title or the description.

Github allows you to search by filetype, which is useful, but will assume you want to query inside files. In other words the query “flights extension:csv” will return csv files with the word “flights” in them (or in the file name) and not repositories that match flights and have csv files. You cannot run a filetype filter without a search query.

So once again, even if Github was the perfect solution for hosting data (which it is not) it can be very difficult to find the data that’s up there. We can’t harvest the data from Github if we can’t find it on Github. This was our first problem.

Luckily there is a service that can search Github and find all the csv files in public repositories. It can even filter it by time period so that every day we have a timely listing of new data to steal.

It’s called Google ūüôā

Link Building Through Pull Request

Once I knew where the files I wanted were, importing was pretty easy. Github follows nice, orderly, predictable url patterns. I could download the raw csv file, reuse the repo’s metadata from Github’s API, and put the whole thing in the queue for Exversion with just a click of a button. But I wanted more. I wanted some way to reach out to the people struggling to use Github as a data sharing solution and let them know that we exist.

So once I confirmed the data had been imported correctly, I automated the process of forking the original repo, editing the README.md file to add a link back to the “mirror” on Exversion and committed the change back to Github.

Let me tell you, who ever designed Github’s API is a very smart guy because there is one thing you cannot do through the API and that is create a pull request across forks. Working on this project made me realize how amazing it is that Github has not had the spam issues of other large, social websites. I suppose it would not be unfair to call what I was doing the world’s first pull request spam (but then again there are a lot of weird and wonderful things that go on via pull request) and I do feel a bit … dirty about it. Like, sure it’s incredibly evil, that doesn’t bother me … but it … mmm … inches a bit too close to an ethical line here.

At the same time, most everyone I’ve contacted through pull request has been incredibly cool. By preventing me from automating this last step I had to take the time to write a short message explaining that I’d found the data interesting enough to mirror on a site where it could be accessed via API. Most people putting data up on Github understand the convenience of having an API and were more than happy to accept my pull request.

It was a simple win-win outreach strategy: because even if the user never checked out Exversion to see what we offered over Github, we still got a nice link back from a high quality domain. And if the user accepted our pull request? That link back became even more valuable!

Github for Data

One of our most important goals in developing Exversion is trying bring together a community of data enthusiasts. There is no gathering place for people who love data: we cross too many age groups, industries, technical skill levels. But what sites like Github ultimately prove is how communication and collaboration within a community can incubate innovation. What would technology look like today if Github didn’t exist? Would languages like Ruby and Python dominate? Would Julia or Clojure ever gotten off the ground?

Traction is important, but far more important is reaching out to like minded people who will ultimately appreciate what you are trying to build.

Be As Evil As Possible: How We Got Our Competitors to Promote Us On Twitter Without Knowing It

I often joke that if Google’s motto is “Don’t Be Evil”, Exversion’s motto is “Be As Evil As Humanly Possible”. We are– to be frank– sneaky bitches who enjoy being sneaky bitches.

When your company is tiny, bootstrapped, and marked for death by just about everyone who matters in startup land being evil becomes a righteous pursuit, a sort of noble savagery that inevitably charms people who remember what it’s like to fight for the survival of your dream.

For me the distinction between (forgive the term) endearing evil and the sort of startup evil that creates backlash and bad PR is power dynamics. Are you being evil to protect your power base? To hoard more privilege than you need at the expense of others? Or are you being evil because you’re on the losing side of an inherently unfair game? Are you struggling against silly rules and resource-controlling systems that people with the right friends or the right names on their degrees get to bypass?

Is it a shock to anyone that this is not a meritocracy? No? Have we finally dispelled this notion? Good.

Sure, if I’m any good at my job there will come a time when Exversion will be big enough and strong enough to get it’s way and then it will fall on me to channel my natural sneakiness into other, less malevolent, pursuits … like … I don’t know, creating elaborate honeypots to annoy Chinese hackers maybe.

Thankfully today is not that day so I can spend my time passively highjacking my competitors Twitter accounts to promote Exversion.

First let me clarify exactly what I mean by “highjack”: No accounts were hacked, no passwords compromised, no spam was sent to anyone either via direct message, tweet or other sources. The victims of my shameless evil had no disruptions in their normal Twitter experience. The security and stability of Twitter as a service was not violated.

Here’s how we did it.

Like All Things Evil, It Started On Facebook

Someone, somewhere, at some point an article from DailyTekk landed on my¬†timeline. I had never read DailyTekk before and soon found myself pulled into a vortex of links. What are the 10 best box-on-the-month subscription services I’ve never heard of? Top seven apps for organizing my life? Go on…

Page after page of articles listing apps I had never heard of but immediately wanted to try. I just kept browsing, going deeper and deeper into the site until I had ten tabs open with different DailyTekk articles.

At some point I came across fledgling Twitter engagement service Flounder. This is when the fun started.

Flounder is a neat little idea: it tracks the activity of your employees’ (or team members’) personal Twitter accounts and looks for interactions. If a conversation between a team member and another Twitter user goes longer than three interactions, Flounder follows the user your team member is talking to from your startup’s account.

So in other words, chat with @bellmar a bit and @exversiondata will automatically follow you. If you don’t follow back, after a couple of days @exversiondata unfollows you and moves on.

Well, what could it hurt? I figure … I’m not especially active on Twitter anyway, let’s give it a try and see what happens.

While I was setting up an account for @exversiondata I realized that Flounder asks you for the Twitter accounts of your team members and never… actually … you know CONFIRMS that they work for you in any way. No authentication. No message that you’ve been added to a group. Nothing. So I added Jacek to my group. True, he left the company months ago… but since he still lists Exversion on his Twitter bio I figured he was fair game.

Then I got to thinking: there was nothing stopping me from putting my competitors on my Flounder team. That way I would be automatically reaching out to people passionate about data without having to do any work actually identifying them.

Oh but then I thought… wait. Why stop there? I can take this to a whole ‘nother level.

The Follow Back Is Dead, Long Live the Follow Back

When Twitter was new, following people was an excellent way to get them to follow back. Not so much anymore. But then follow back isn’t the real goal here. Getting more followers would be nice, but what we really want is to expose new people to Exversion. Get them curious, so that they click on the link in our Twitter profile and maybe sign up for an account. Alternatively they might click on our timeline and read a few blog posts on this blog, that would be good too.

The following acts as a first contact point. The user will get an email, letting them know we exist and telling them a little bit about us. For some that will be enough to piqued their curiosity … I read somewhere that it takes at least¬†three touches to trigger a conversion. Then I read somewhere else it takes 8~10 touches. So whatever, let the bs marketing people split hairs over this, the point is one contact is not going to be enough. We need to follow up with these new friends.

Luckily, there’s a really easy way to do this: favorite their tweets. According to Adorer¬†30% of people follow your account after you favorite one of their tweets. This was actually not the first time I had heard something like this.

So I wrote a quick script that picks the most recent accounts @exversiondata is following, pulls their tweets, looks for the word “data” and favorites one tweet.

Then I registered a cron job to have this script run every day at 10 am.

As Flounder finds new people to follow, every day for two or three days those users get a notification that we have favorited one of their amazing, brilliantly written tweets. They don’t get spammed incessantly with unwanted contact. They just get pinged, casually and unaggressively a few times.

The Results

The first day using Flounder I noticed that @exversiondata had followed my friend Jay who I often converse with over twitter. So the damn thing works, awesome.

The second thing I noticed is that @exversiondata is suddenly following the twitter accounts of the guy who made Flounder and Flounder’s parent company– SNEAKY BITCHES!! Although I suppose given what I’m up to I don’t have the right to complain. LOL. Okay whatever, they can have a follow back if they want. I don’t mind.

By the end of the day I not only had a couple follow backs, but a query from someone who wanted to use our service for a big scary data project, and a handful of new followers not at all connected to this scheme (thanks Twitter algorithms!). The rate of follow backs stayed at 10%~15%  throughout the whole experiment.

This was a big surprise. If you had asked me before this experimented started I would have said simply following someone was not enough to entice them to follow you back. People are too cynical, too bombarded with corporate messaging, there are too many spam bots on Twitter. It will never work.

But it does work, maybe not as well as it did when Twitter was new and people were more naive but if you’re targeting individual users most likely to be interested in what you’re doing simply following people does still have an impact.

And that’s not even the best part of this experiment. The most dramatic change was in traffic from Twitter, which more than tripled in a week. This is the data on referrals from Twitter, starting a month ago … can you tell when this experimented started?

Screen Shot 2014-08-24 at 9.34.10 PM

Even more impressive is how much longer these the average visitor stayed on the site during this experiment:

Screen Shot 2014-08-24 at 9.30.34 PM

Slow Burn Hacking

The growth was gradual, a few people pinged every day, but effective. After all, too much and we would have risked attracting negative attention from Twitter itself, which sort of defeats the point doesn’t it?

As much as we all wish for the excitement of instant explosions of growth, the slow burn is harder to track and more closely resembles legitimate user behavior. Because the bot doing the favoriting was only drawing from a pool of users identified through the networks of data companies it was unlikely to favorite something unsuitable by accident. In general it would only favorite three-four tweets a day and follow about 15~20 people a day hardly the kind of automatic management that tends to piss of Twitter. (Of course depending on how well this post does on HackerNews, we get be suspended tomorrow!)

Other than occasionally picking up the odd random account (DominosCareers? Really?) the system was really quite good at identifying people we would otherwise want to follow anyway and helped us participate in some really interesting data conversations.

Another fun, incredibly evil, side effect of this kind of hack is that the people who are most likely to tweet a company directly tend to be customers with problems. So you are pinging people right when they are most likely to consider alternatives to whatever they are currently using.

On the other hand, it’s hard to tell how effective this strategy will be outside the data community. Despite the great fanfare over “Big Data”, there are too many underserved areas. Only a handful of startups working on open data, version control, normalizing and cleaning data. A few more than that working on the collection of data itself. In other markets where every five minutes there’s another startup launching the same tactic may not have the same effect.

Still, what warms the evil core of my heart is the fact that the better your competition’s social media strategy is, the more effective they are at reaching out and developing conversations with potential customers, the easier it is for you to identify these people. Their outreach delivers the best potential evangelists for your project straight to your door, gift wrapped.

I Submit to Our Robot Overlords Because I’m Kind of Lazy

About a month ago I bought a robot vacuum. The damn thing is mesmerizing… the way it wonders around my apartment, getting into places I didn’t think it could, finding new ways to completely screw up my cowhide rug. The other day while cleaning it out I realized it had managed to suck up one of my battery pack adapters that must have fallen onto the floor.

It’s pretty amazing. I run it while I’m cleaning other things or when I hop in the shower and it cuts my list of chores down by at least one big task every week.

I’ve began to apply a similar philosophy to Exversion¬†and last night you may have received the first of many regular robot assembled mailings.

We have never made good use of our mailing list, everyone on the team always has other things they would rather do first. And up until recently it was always such a chore to figure out what we should say… new feature alerts? Sure, but how often do those happen? We blog about them too so do we just copy and paste from the blog post or do we link the blog post or do we have to write whole new content just for the mailing? AGH.

A few weeks ago we started something called Thoughts from the Exversion Team because often we find ourselves wanting to write something about data, data science, or the multitude of data tools that exist but not wanting to write a whole blog post about it. The blog hooks into Facebook and Twitter. It seems bad form to spam everyone with three paragraph posts all the time. There might be one new blog post a week, there are sometimes two or three thoughts posted in a day.

Once that got started though, with Data Requests, Data Thoughts and actual freaking data we had enough content and enough diversity of content to put together an interesting regular mailing.

In addition to pulling the best of our own content, our mailing list¬†also pulls the best data news of the week from HackerNews. So each week you’ll receive the best stories concerning the wide wonderful world of data, curated by a crowd of cynical hackers with impossibly high standards. That alone should make it worth while. Who wants to be on HackerNews all the time?

Of course the robot doesn’t have full control of the mailing list … that would get gross really quick. No, the robot assembles the campaign and then forwards it to me for my approval. So there’s always a human pair of eyes making sure the content is quality.

If you’re not on our mailing list, obviously now is the time to join.

PyGotham, Open Source and Bad Data

A month ago a good friend of mine asked me if I would consider giving a talk at this year’s PyGotham. I enjoy speaking at conferences, so I almost never turn down an opportunity. Python being one of my favorite languages the only issue was what to talk about.

“Can I do it on debunking other people’s data science?” I asked.

A lot is written about and lectured on various libraries and modules to load up data and do all kinds of analysis. Particularly in the python community. Python is slowly becoming the language of data. True there are other options that are specifically tooled for analysis (R, Julia, etc) but python combines a lower barrier to entry (easy to learn, already installed), powerful options, and a large active community.

What we don’t talk a lot about is data quality. Most talks on data science start off with “so we take our data and…” with very little comment on getting the data and prepping the sample. And yet it’s these two stages that are the most complicated, require the most careful thought, and where mistakes are the most damaging. (Just today the NYT published a piece on this very problem)

There are also very few tools to help identify and prevent these issues.

My original plan was to walk people through these types of mistakes with real world examples. I was a little concerned about how little python would actually be included in this discussion, but I kept telling myself data science is a huge part of the python community and this is a huge issue in data science. These are not problems that only trip up students and the intellectually inferior. It was ridiculously easy for me to find examples from major publications like The New York Times and respected blogs like FiveThirtyEight. The consequences of bad data are everywhere.

Then in the middle of ¬†the conference a bolt of inspiration hit, we write code that tries to validate and unit test our processes all the time. The main problem with data science libraries is that they assume that you understand all the caveats and proofs associated with each model. But why can’t we write a library that would analyze a dataset and give you feedback on structural issues, potential sampling errors, normalization issues, etc?

So I went home, bought a six pack of beer, ordered a pizza and changed half of my presentation.(1)

We Are Open Source

We have been working on isolating and open sourcing different components of Exversion‘s technology for a while. It’s slow going, largely because with a small team I always have to choose between building new and modifying the¬†old to the extent that it needs to be in order to click into place on someone else’s stack. Mind you, we have tweaked our processes to build things that way the first time, but it took us a while to get those habits in place mentally.

Right now our major open source projects are as follows:

Junky: Dataset Profiling

The project I pitched at PyGotham is called Junky, it got an amazing response with four or five people volunteering to be contributors right on the spot. While I originally called it a dataset validator for lack of a better term.¬†Eric Schles¬†smartly described it more as a dataset profiler… which I think better captures what it will be able to do. A profiler doesn’t tell you that your code is good, just how much time it takes to execute, how much memory it uses, etc. From there you may choose to clean things up, or you may not. The profiler’s job is to show you what’s going on that you might not see otherwise.

Likewise,¬†Junky¬†will not tell you if your data is good or bad, but it will measure things like how consistent your categories are, test for normal distributions, how robust your sample population is, identifying outliers, etc. These are things that anyone with any formal training in data science learns to do before cracking out a linear regression model, but with a lot of people coming into the data science frontier from programming, sometimes the basic first steps are missed. Self-taught data scientists sometimes jump directly to executing commands in a stat library, without knowing very much about the requirements of the models they’re using. Most of the common statistical analysis methods assume, for example, that your population data has something resembling a normal distribution (ie – a bell curve) but it is ridiculously easy to collect a sample that isn’t normal in the statistical sense and that will be something you want to know before you do your analysis and draw conclusions.

When you’re self-taught you learn best through mistakes. Most major mistakes in programming blow up in your face really quickly. But in data science critical errors can go unnoticed for long periods of time, crippling the passionate beginner’s ability to learn and improve.

Junky is intended as a way to help people who want to do data right, explore those problems on their own.

Data Cleaning Boilerplate

Another general data tool I’ve been working on is the Data Cleaning Boilerplate. The concept is simple, I write a lot of data cleaning scripts. Usually I end up writing new scripts from hacked together, copy-and-pasted bits of old scripts. At some point I decided it would be really useful to start writing more generic functions for things I end up doing over and over again so that I can copy and paste from the boilerplate and get things done faster.

Exversion Layer

Layer is a stand alone version of Exversion’s version control system. It hooks into Postgres and uses a RESTful API to receive and return changes in data state.

Exversion Server

Exversion Server¬†will be an open sourced version of Exversion’s data store technology. While there isn’t much to look at now, I’ve put this project back on the main dev schedule for the coming months largely after conversations with Rufus Pollock of OKFN and the HDX team at the UN. CKAN, which HDX uses, has a data store module that appears to be setup exactly the same way we set up the prototype of Exversion that we hacked together on the bus down to SXSW. We ditched that model for very good reasons when we came back up to NY and so I think it will be worth while to release something that can be hooked into CKAN as an alternative. Will blog a more detailed explanation of my thinking in this regard when that is ready to go.

All of these projects welcome potential contributors, so if you’re interested please file an issue letting us know what you’d like to improve about them.


(1) I’ll put the final presentation online as soon as PyGotham releases the video. For now here are the slides.