Differential Diagnosis in Production


This week we celebrated Labor Day by giving our web server some time off 🙂 Why should the relaxing qualities of holidays be reserved for conscious life forms only? One day the machines will rule us all, might as well start making nice now.

But seriously, the process of diagnosing and debugging a critical failure is not something often talked about, yet every technologist will have to deal with it eventually. Even if every piece of code you push is absolutely perfect, modern day web programming has too many moving parts … you can’t be an expert on all of them.

And when your service is down, the clock is ticking, and the problem appears to be buried deep in something beyond your expertise, everything depends on your resourcefulness. So, yes, this weekend Exversion suffered a critical failure that took the site offline for about twenty minutes on Sunday and about two hours on Monday, but the process of tracking and eliminating the problem turned out to be pretty interesting.

Early Symptoms
It started with an email from Lee Byron. Mixed in with some general feedback was an offhand comment about difficulty logging in. The login seemed to be “broken” but no other details were provided.

Neither Jacek or I were able to reproduce the issue, nor were any of our other users complaining. I emailed Lee back anyway, asking for more information about the error he got (if any). At this point there seemed like two likely explanations:

  • Some strange cross browser issue.
  • He hadn’t activated his account by verifying his email address

When I didn’t hear back I assumed the issue had been “user error” and took it off my agenda.

Locked Out
A few days later on Saturday morning, I was giving a demo to a perspective client on his computer. I logged in, only it didn’t take. No errors, everything appeared to work, but the “Login/Sign Up” buttons were still on the top nav and the members only features weren’t displaying.

“Do you have any kind of privacy, anonymous surfing software installed on this machine?” I asked.

“No, not that I know of.”

Exversion is built on Symfony2, using Friends of Symfony’s UserBundle (with some custom tweaks to the invitation recipe to suit our purposes) for User Management. By default this stores sessions in Symfony’s cache.

So the first thing I did when I encountered this problem myself was SSH into the server and clear the cache.

I could then login without any problems, and the rest of the meeting went really well. But obviously this wasn’t a real solution. Why wasn’t Symfony creating user sessions correctly? Was the problem with Symfony itself or FOSUserBundle? I started googling and digging through github issue reports, trying to find a clue … but there was nothing.

Original Diagnosis
There also wasn’t anything useful in either Symfony’s logs or apache’s so the best explanation I could come up with on virtually no information was that old sessions weren’t being cleared away and were filling up the cache. The solution seemed to be to move sessions out of the cache itself, something I had been meaning to do for a while anyway because I push code to Exversion every day and every time I clear the cache it logs everyone out.

But I wasn’t familiar with the procedure for moving the sessions folder and I wanted to read up on it and run some tests before going through with it. Like most, I prefer to do major changes to the infrastructure somewhere around 3am~4am so that if something goes wrong (God forbid) the impact to users is minimum.

And unfortunately I had to be up early on Sunday to head out to my parents. I hadn’t visited my family in about a good two months … pretty bad seeing as they live basically an hour away. So I figured I would keep an eye on the server and leave the permanent fix until I was back in the office on Tuesday.

Offline (part one)
I was in the middle of having a mojito with my mother and looking at pictures of their cruise to Alaska when I got the email from Jacek:

Not to ruin your weekend, but can’t login and can’t register on the site :/

login redirects to splash page, and register gives me a 503. :/

Effin’ hell … already? “Okay,” I texted back. “It’s a problem with the cache, I’ll fix it. Don’t worry.”

I SSHed into the server and cleared the cache. Then I reloaded Exversion.com expecting to be able to login and go back to my drink. Only what I got instead was a 500 error … and not one of our pretty custom ones either.

The site was dead.

HOLY CRAP, Fix It Now!
I spent the better part of the next twenty minutes texting back and forth with Jacek. He had limited wifi at his apartment and couldn’t get into the server but was able to confirm that the server itself was still running. The worst part of dealing with these sort of things is that Exversion is at present a two person team. I write the code and design most of the infrastructure, Jacek maintains the servers and does the UX. When something unexpected has gone wrong the best asset is another perspective or (barring that) the ability to step away from the problem and clear your mind. But we needed to get back online ASAP, and with only two of us fresh perspectives and breaks were in short supply.

I went through the usual rounds of magic tricks: I restarted apache, I tried clearing the cache again. Symfony’s logs now had some useful errors in them announcing that Twig could not write to the cache folder.

99.9999% of the time this means the permissions haven’t been set correctly. The cache folder is owned by the root user, but apache needs to be able to write to it.

And this made absolutely no sense because the permissions had not been changed and Exversion had been functioning correctly with those settings for months. How could a folder change its own permissions magically? It couldn’t.

But I didn’t have any better ideas, so I tried resetting the permissions. No luck. Examined the subfolders to make sure the command had executed recursively. No dice. Finally I just got so  frustrated I deleted the entire cache folder, recreated it, reset its permissions and reloaded the page.

Bingo. We were back up.

At this point let me stop and quickly explain something for people not familiar with Symfony. Symfony has a neat set of command line functions that help with a variety of administrative tasks. The cache is separated out into multiple folders for each environment (although most applications will only have two: development and production). The command to clear the cache requires you to specify which environment to clear, and it won’t touch the other folders in the cache. If you specify no environment, it will clear development by default.

In addition, Symfony rebuilds its cache when it detects something missing. So deleting the entire cache directory and recreating it wasn’t as extreme as it might sound. Symfony would rebuild the folder structure it needed automatically.

Offline (part two)
I ran a few tests, confirmed the site was back online and back to fully functioning, then I logged on to Exversion’s official Twitter account and announced that we were working through a problem with Symfony’s cache and to let us know about any errors.

I went back to google. It no longer seemed likely that this was a problem with user sessions. Could it be some bug in an update to Symfony core? Could it be an apache issue? I was having a hard time identifying anything useful and yet I knew that it was only a matter of time before the problem struck again.

It happened about 3:30 the next afternoon

I confirmed the issue myself and tried deleting the cache again, hoping another rebuild would work the same magic.

No luck.

The whole site was down again. Now not even the front page would load. To make matters worse the logs were not updating either. The server log was cut off midline, which I had never seen before, while Symfony’s log wasn’t updating with any errors.

I was flying blind.

About a month ago, just before we came out of alpha, Swift Mailer stopped sending emails. Since we were in private alpha, it wasn’t a major tragedy, but– like now– there weren’t any errors being thrown. I spent the whole day working through every component of our email system, until I figured out it was a bad update. I tend not to expect large, widely used open source projects to fail … but as soon as I updated the package the problem was solved.

So now with no other options, I ran a composer update and prayed.

Nothing.

Exversion was broken and I had no idea what was wrong or how to fix it.

(To Be Continued. Part Two with the solution will be posted on Sunday. For now, take your best guess. What troubles this ailing server?)

4 thoughts on “Differential Diagnosis in Production

  1. Pingback: Differential Diagnosis in Production : The Outcome | Happy Endpoints

Leave a comment