Differential Diagnosis in Production : The Outcome

it works exversion

(At last the conclusion to Differential Diagnosis in Production. Could you figure out what was wrong with the server? Here’s the answer…)

As panic began to set in, I remembered something useful. Some commands on Symfony’s console are more picky than others. They will not execute if there’s an error, even if that error is something small in a file that has nothing to do with what the command should be doing. In the past I found this strictness super annoying, but now I realized it could be useful. If the server had stopped writing errors to the logs, than perhaps I could force it to write the errors to standard output instead. It was a long shot, but I didn’t seem to have any other options.

And it worked.

I chose php app/console doctrine:schema:update --force, which checks that the database matches the ORM schema outlined in the code. Since there were no updates, the script should run and report back that everything was up-to-date. But instead out came out a big red error block.

Son of a– We were out of disk space.

“That’s impossible,” Jacek said. “We have like 80GBs.”

And yet pulling up the disk usage confirmed it, the web root was using over 75GB of space. Suddenly everything made sense. The problem had nothing to do with Symfony’s cache, but clearing the cache opened up enough space to allow Exversion to operate. And when the server didn’t even have enough space to write to the logs, deleting the cache stopped fixing the problem.

So one question remained: Symfony doesn’t use that much space. Our entire development environment, all components of it, is only about 2GB. What was eating up all that space?

In a word: downloads.

How Do People Want Their Data?
Bulk data downloads were never part of our ultimate vision, but we had to provide them in order to give people the ability to branch and remap the data they wanted. It was an inefficient solution, but until we finished the features that allowed people to edit their data through Exversion directly (and we’re almost there) we had to offer it.

What we didn’t anticipate is how popular downloading data would be. And why would we? All of our competitors focus on serving files. They are older, more established, larger companies with greater visibility. Everyone who comes to Exversion has had experience with them.

So we assumed that what people wanted from us were features that our competition didn’t offer. After all, why download a file through us when you can go directly to the source and download it there?

Part of the answer is that people clean data found on those other sites and upload it to Exversion because they can’t upload them on our competitors. Another reason is that we index all those other sites for unified search. There are some who would say no one needs data search because Google does such a good job indexing everything … and those people have clearly never actually tried to find data through Google. It’s hard. It’s really freaking hard. We have three upcoming events where we’ve been given the rank and privileges of sponsors just for helping the organizers find data.

But back to our technical troubles, Exversion doesn’t host files. Downloads are generated dynamically based on the current state of the data. To keep the load on the database down, we don’t delete the file generated right away but clear them out on a regular basis. All that was working fine. We just didn’t anticipate how popular downloading would be. To put it simply: we weren’t clearing out downloads often enough.

Earthquakes, Drones and Gun Violence
Once we realized what the problem was, our first assumption was that this was all about PLUTO. PLUTO (for those that don’t know) is a massive collection of real estate data split into two parts: metadata and shape files. We released PLUTO, including a GeoJson version of those shape files, about a month ago. Obviously downloads were eating up more space faster because people were downloading these much larger files.


This is what people have been downloading

There were over 4,000 downloads between the regular clearing of files and the crash. And hundreds more have been downloaded since Labor Day. The most popular of the MapPluto datasets was Queens, followed by Staten Island of all things. Manhattan is the least popular.

Then there are the earthquake datasets, which are overwhelmingly popular both on Exversion and on Data.gov where they originate. I have no idea what people use them for.

Do We Change What We’re Doing?
Month ago I wrote a somewhat contraversial post about cognitive biases in tech. Some interpreted it as a criticism of Paul Gramham and YC, but if it was it was even more so a criticism of myself. The post spent much more time detailing false assumptions that I had made … and apparently continue to make.

Exversion had a system for reporting the number of downloads. There was a bug in it, which was why we didn’t notice the escalating level of activity until it took us down. I actually had noticed the bug months before all this, but I did not make fixing it a priority because it confirmed what I already wanted to believe about what we were doing. In other words, fixing the download counter seemed less important than handling other things because I assumed that users weren’t as interested in downloading data as they were in other features … but as long as the download counter was broken we couldn’t collect the data that proved me wrong. That obvious fact never occurred to me … which is ironic really.

So, once discovered, the problem was easy to fix: I cleared out the downloads, rewrote the process of generating downloads to be more resource efficient, fixed the download counter (adding some extra logging of that activity to boot) and tweaked the cron job that clears downloads to clean up that folder more often. For the time being we’ve disabled downloading of our much larger Pluto files, because we have to figure out how this changes things for us.

I built the download system based on the idea that it was a temporary solution that would ultimately be phased out. It was neither designed nor intended for the type of usage its seeing. I’m not sure what its performance is like on those large repositories, to be honest.

And since we are getting a lot of activity in this area the question remains, do we need to rethink our philosophy about the future of data? Should we make downloads more of a priority and redirect resources to building better infrastructure there? Plenty in Silicon Valley would say “Yes, absolutely”, but if the world was run by venture capitalists we would have more apps for Miley Cyrus to humiliate herself on and no roads. Building infrastructure is an unglamorous thing. It’s value can’t be gauged with public opinion, because it’s value doesn’t appear until someone builds something on top of it.

Lately we’ve been having more and more meetings with NGOs looking for a way to share and collaborate data. The feedback we’ve gotten from them suggests we’re on the right track. There’s a need that no one is serving that we are uniquely positioned to fill … and once implemented the solution could be truly game changing.

So we’ll see.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s