Differential Diagnosis in Production : The Outcome

it works exversion

(At last the conclusion to Differential Diagnosis in Production. Could you figure out what was wrong with the server? Here’s the answer…)

As panic began to set in, I remembered something useful. Some commands on Symfony’s console are more picky than others. They will not execute if there’s an error, even if that error is something small in a file that has nothing to do with what the command should be doing. In the past I found this strictness super annoying, but now I realized it could be useful. If the server had stopped writing errors to the logs, than perhaps I could force it to write the errors to standard output instead. It was a long shot, but I didn’t seem to have any other options.

And it worked.

I chose php app/console doctrine:schema:update --force, which checks that the database matches the ORM schema outlined in the code. Since there were no updates, the script should run and report back that everything was up-to-date. But instead out came out a big red error block.

Son of a– We were out of disk space.

“That’s impossible,” Jacek said. “We have like 80GBs.”

And yet pulling up the disk usage confirmed it, the web root was using over 75GB of space. Suddenly everything made sense. The problem had nothing to do with Symfony’s cache, but clearing the cache opened up enough space to allow Exversion to operate. And when the server didn’t even have enough space to write to the logs, deleting the cache stopped fixing the problem.

So one question remained: Symfony doesn’t use that much space. Our entire development environment, all components of it, is only about 2GB. What was eating up all that space?

In a word: downloads.

How Do People Want Their Data?
Bulk data downloads were never part of our ultimate vision, but we had to provide them in order to give people the ability to branch and remap the data they wanted. It was an inefficient solution, but until we finished the features that allowed people to edit their data through Exversion directly (and we’re almost there) we had to offer it.

What we didn’t anticipate is how popular downloading data would be. And why would we? All of our competitors focus on serving files. They are older, more established, larger companies with greater visibility. Everyone who comes to Exversion has had experience with them.

So we assumed that what people wanted from us were features that our competition didn’t offer. After all, why download a file through us when you can go directly to the source and download it there?

Part of the answer is that people clean data found on those other sites and upload it to Exversion because they can’t upload them on our competitors. Another reason is that we index all those other sites for unified search. There are some who would say no one needs data search because Google does such a good job indexing everything … and those people have clearly never actually tried to find data through Google. It’s hard. It’s really freaking hard. We have three upcoming events where we’ve been given the rank and privileges of sponsors just for helping the organizers find data.

But back to our technical troubles, Exversion doesn’t host files. Downloads are generated dynamically based on the current state of the data. To keep the load on the database down, we don’t delete the file generated right away but clear them out on a regular basis. All that was working fine. We just didn’t anticipate how popular downloading would be. To put it simply: we weren’t clearing out downloads often enough.

Earthquakes, Drones and Gun Violence
Once we realized what the problem was, our first assumption was that this was all about PLUTO. PLUTO (for those that don’t know) is a massive collection of real estate data split into two parts: metadata and shape files. We released PLUTO, including a GeoJson version of those shape files, about a month ago. Obviously downloads were eating up more space faster because people were downloading these much larger files.

….Nope.

This is what people have been downloading

There were over 4,000 downloads between the regular clearing of files and the crash. And hundreds more have been downloaded since Labor Day. The most popular of the MapPluto datasets was Queens, followed by Staten Island of all things. Manhattan is the least popular.

Then there are the earthquake datasets, which are overwhelmingly popular both on Exversion and on Data.gov where they originate. I have no idea what people use them for.

Do We Change What We’re Doing?
Month ago I wrote a somewhat contraversial post about cognitive biases in tech. Some interpreted it as a criticism of Paul Gramham and YC, but if it was it was even more so a criticism of myself. The post spent much more time detailing false assumptions that I had made … and apparently continue to make.

Exversion had a system for reporting the number of downloads. There was a bug in it, which was why we didn’t notice the escalating level of activity until it took us down. I actually had noticed the bug months before all this, but I did not make fixing it a priority because it confirmed what I already wanted to believe about what we were doing. In other words, fixing the download counter seemed less important than handling other things because I assumed that users weren’t as interested in downloading data as they were in other features … but as long as the download counter was broken we couldn’t collect the data that proved me wrong. That obvious fact never occurred to me … which is ironic really.

So, once discovered, the problem was easy to fix: I cleared out the downloads, rewrote the process of generating downloads to be more resource efficient, fixed the download counter (adding some extra logging of that activity to boot) and tweaked the cron job that clears downloads to clean up that folder more often. For the time being we’ve disabled downloading of our much larger Pluto files, because we have to figure out how this changes things for us.

I built the download system based on the idea that it was a temporary solution that would ultimately be phased out. It was neither designed nor intended for the type of usage its seeing. I’m not sure what its performance is like on those large repositories, to be honest.

And since we are getting a lot of activity in this area the question remains, do we need to rethink our philosophy about the future of data? Should we make downloads more of a priority and redirect resources to building better infrastructure there? Plenty in Silicon Valley would say “Yes, absolutely”, but if the world was run by venture capitalists we would have more apps for Miley Cyrus to humiliate herself on and no roads. Building infrastructure is an unglamorous thing. It’s value can’t be gauged with public opinion, because it’s value doesn’t appear until someone builds something on top of it.

Lately we’ve been having more and more meetings with NGOs looking for a way to share and collaborate data. The feedback we’ve gotten from them suggests we’re on the right track. There’s a need that no one is serving that we are uniquely positioned to fill … and once implemented the solution could be truly game changing.

So we’ll see.

Differential Diagnosis in Production


This week we celebrated Labor Day by giving our web server some time off 🙂 Why should the relaxing qualities of holidays be reserved for conscious life forms only? One day the machines will rule us all, might as well start making nice now.

But seriously, the process of diagnosing and debugging a critical failure is not something often talked about, yet every technologist will have to deal with it eventually. Even if every piece of code you push is absolutely perfect, modern day web programming has too many moving parts … you can’t be an expert on all of them.

And when your service is down, the clock is ticking, and the problem appears to be buried deep in something beyond your expertise, everything depends on your resourcefulness. So, yes, this weekend Exversion suffered a critical failure that took the site offline for about twenty minutes on Sunday and about two hours on Monday, but the process of tracking and eliminating the problem turned out to be pretty interesting.

Early Symptoms
It started with an email from Lee Byron. Mixed in with some general feedback was an offhand comment about difficulty logging in. The login seemed to be “broken” but no other details were provided.

Neither Jacek or I were able to reproduce the issue, nor were any of our other users complaining. I emailed Lee back anyway, asking for more information about the error he got (if any). At this point there seemed like two likely explanations:

  • Some strange cross browser issue.
  • He hadn’t activated his account by verifying his email address

When I didn’t hear back I assumed the issue had been “user error” and took it off my agenda.

Locked Out
A few days later on Saturday morning, I was giving a demo to a perspective client on his computer. I logged in, only it didn’t take. No errors, everything appeared to work, but the “Login/Sign Up” buttons were still on the top nav and the members only features weren’t displaying.

“Do you have any kind of privacy, anonymous surfing software installed on this machine?” I asked.

“No, not that I know of.”

Exversion is built on Symfony2, using Friends of Symfony’s UserBundle (with some custom tweaks to the invitation recipe to suit our purposes) for User Management. By default this stores sessions in Symfony’s cache.

So the first thing I did when I encountered this problem myself was SSH into the server and clear the cache.

I could then login without any problems, and the rest of the meeting went really well. But obviously this wasn’t a real solution. Why wasn’t Symfony creating user sessions correctly? Was the problem with Symfony itself or FOSUserBundle? I started googling and digging through github issue reports, trying to find a clue … but there was nothing.

Original Diagnosis
There also wasn’t anything useful in either Symfony’s logs or apache’s so the best explanation I could come up with on virtually no information was that old sessions weren’t being cleared away and were filling up the cache. The solution seemed to be to move sessions out of the cache itself, something I had been meaning to do for a while anyway because I push code to Exversion every day and every time I clear the cache it logs everyone out.

But I wasn’t familiar with the procedure for moving the sessions folder and I wanted to read up on it and run some tests before going through with it. Like most, I prefer to do major changes to the infrastructure somewhere around 3am~4am so that if something goes wrong (God forbid) the impact to users is minimum.

And unfortunately I had to be up early on Sunday to head out to my parents. I hadn’t visited my family in about a good two months … pretty bad seeing as they live basically an hour away. So I figured I would keep an eye on the server and leave the permanent fix until I was back in the office on Tuesday.

Offline (part one)
I was in the middle of having a mojito with my mother and looking at pictures of their cruise to Alaska when I got the email from Jacek:

Not to ruin your weekend, but can’t login and can’t register on the site :/

login redirects to splash page, and register gives me a 503. :/

Effin’ hell … already? “Okay,” I texted back. “It’s a problem with the cache, I’ll fix it. Don’t worry.”

I SSHed into the server and cleared the cache. Then I reloaded Exversion.com expecting to be able to login and go back to my drink. Only what I got instead was a 500 error … and not one of our pretty custom ones either.

The site was dead.

HOLY CRAP, Fix It Now!
I spent the better part of the next twenty minutes texting back and forth with Jacek. He had limited wifi at his apartment and couldn’t get into the server but was able to confirm that the server itself was still running. The worst part of dealing with these sort of things is that Exversion is at present a two person team. I write the code and design most of the infrastructure, Jacek maintains the servers and does the UX. When something unexpected has gone wrong the best asset is another perspective or (barring that) the ability to step away from the problem and clear your mind. But we needed to get back online ASAP, and with only two of us fresh perspectives and breaks were in short supply.

I went through the usual rounds of magic tricks: I restarted apache, I tried clearing the cache again. Symfony’s logs now had some useful errors in them announcing that Twig could not write to the cache folder.

99.9999% of the time this means the permissions haven’t been set correctly. The cache folder is owned by the root user, but apache needs to be able to write to it.

And this made absolutely no sense because the permissions had not been changed and Exversion had been functioning correctly with those settings for months. How could a folder change its own permissions magically? It couldn’t.

But I didn’t have any better ideas, so I tried resetting the permissions. No luck. Examined the subfolders to make sure the command had executed recursively. No dice. Finally I just got so  frustrated I deleted the entire cache folder, recreated it, reset its permissions and reloaded the page.

Bingo. We were back up.

At this point let me stop and quickly explain something for people not familiar with Symfony. Symfony has a neat set of command line functions that help with a variety of administrative tasks. The cache is separated out into multiple folders for each environment (although most applications will only have two: development and production). The command to clear the cache requires you to specify which environment to clear, and it won’t touch the other folders in the cache. If you specify no environment, it will clear development by default.

In addition, Symfony rebuilds its cache when it detects something missing. So deleting the entire cache directory and recreating it wasn’t as extreme as it might sound. Symfony would rebuild the folder structure it needed automatically.

Offline (part two)
I ran a few tests, confirmed the site was back online and back to fully functioning, then I logged on to Exversion’s official Twitter account and announced that we were working through a problem with Symfony’s cache and to let us know about any errors.

I went back to google. It no longer seemed likely that this was a problem with user sessions. Could it be some bug in an update to Symfony core? Could it be an apache issue? I was having a hard time identifying anything useful and yet I knew that it was only a matter of time before the problem struck again.

It happened about 3:30 the next afternoon

I confirmed the issue myself and tried deleting the cache again, hoping another rebuild would work the same magic.

No luck.

The whole site was down again. Now not even the front page would load. To make matters worse the logs were not updating either. The server log was cut off midline, which I had never seen before, while Symfony’s log wasn’t updating with any errors.

I was flying blind.

About a month ago, just before we came out of alpha, Swift Mailer stopped sending emails. Since we were in private alpha, it wasn’t a major tragedy, but– like now– there weren’t any errors being thrown. I spent the whole day working through every component of our email system, until I figured out it was a bad update. I tend not to expect large, widely used open source projects to fail … but as soon as I updated the package the problem was solved.

So now with no other options, I ran a composer update and prayed.

Nothing.

Exversion was broken and I had no idea what was wrong or how to fix it.

(To Be Continued. Part Two with the solution will be posted on Sunday. For now, take your best guess. What troubles this ailing server?)