When software goes wrong.

This is what happens

You walk into the office, or you get a telephone call in the middle of the day. If things are really bad, you'll get the phone call when you're at home and completely disconnected from work. The messages goes; "XYZ has broken and we don't know why." The immediate reaction is to log onto the system and click around to see if it actually is broken. In the best scenario you're only getting a confirmation that things are down the pan, at worst you're going to aggreviate the system even more by instantiating even more load by your clicks. Either way; Bad. At this point there's usually several persons squawking and running around like headless chickens doing the exact same thing. Suddenly the panic circus is in town and everybody is participating in the fun. At this point, you should just stop, or to be precise; your boss should tell you to stop.

Your initial reaction

Your first step should be to verify the problem. If you have good support staff (or whoever answered the phone was on the ball) then you will have an extensive list telling you where the problem occurred, who did it occur to, and what they were doing. It is very cruical that you receive this information in writing otherwise. If it isn't written down then the facts will get skewed and molested into, at worst, a problem that actually doesn't exist and you'll end up spending time chasing ghosts and leaving the real problem alone. Nightmare. With all the information you have, your first task is to verify that there indeed is a problem, and that you can replicate it. Hopefully you'll be able to do this, otherwise you'll be in a very awkward situation where you're thinking that your user is seing and hearing things and that they should up their medication. They, on the other hand, will feel very frustrated because nobody believes them. Tip: If your system logs every failure, you should be able to at least get a hint of that the problem actually happened.
As you now can replicate the problem, you'll be scanning through the logs to see what's been going on. With a bit of luck you've created the appropriate error messages and the logging is set up properly. The result of this should be that when you do A, B happens. If this is not the case, then you're probably sitting on an unstable system that either badly designed or so complex that you're chasing ghosts. Not good again.
Depending on your system you might want to try and restart a few services or even the whole system to see if something has just "borked due to old age" - It happens. I've seen a load balanced system that had one machine restart at even hours and another one at odd hours just to keep the system "stable" - Not kidding!
If the restarting of services/systems doesn't fix the problem - go to another machine and see if you can replicate the problem there. If you can replicate on two machines or systems then you've most certainly discovered a bug of some sorts. This is a good starting point.

Meanwhile on the telephone

Whilst you're administering first aid and trying to locate what exactly is the cause. Your boss has told whoever is in charge of customer psychiatry that they should deal with the panicking customers. The customers have obviously bought your system to achieve a business task and if your provided system is dumping cores they can't conduct their business. At this point it's crucial that any discovery by the developers are immediately communicated to whoever deals with the customers. The one who deals with the customers needs to inform them of a few crucial points.

  • If the problem has been confirmed
  • If the problem has been located
  • That you're doing everything in your power to fix it - if you can, tell them who or how many are working on it
  • What is the procedure going to be when you've fixed the problem. I.e apply the patch, shut the system down, etc.
  • When it's going to be fixed, and if you can't probmise a deadline for the fix, then
  • Tell them you're going to get back to them with a new status within X hours or Y minutes, but most importantly DO IT. Don't bl--dy forget this one. Set an alarm on your computer/PDA/watch. Don't forget to ring back.
  • If possible offer workarounds or alternative services that allows the customer to either do other tasks during the downtime, or use another system. Don't handicap the customer. (Obviously planning for contingency is a "Good Thing"(tm))
  • When the problem is fixed, you will take upon you to inform the customers immediately. They're really worth that effort.

When you've pinpointed the problem

You now know where the problem is, you need to figure out exactly what's causing it. It's vitally important that you can pinpoint the problem, you need to know the exact problem before fixing it.
Here's a few pointers on how to find the problem.

  • Eliminate everything non-vital to the system. Turn off certain things, just wrap them in comments and run again.
  • Only do one change at a time, between checking. It's very crucial that you don't go changing a lot of parameters and wear a cowboy hat.
  • Stop the execution at certain points. In PHP I'd insert a line, something like: print("Line " . __LINE__ . ", File: " . __FILE__ . "<br />"); die(); (this would print out the line and the file the current piece of code is on and then stop the execution. But be careful about stopping execution if you have things that need to be allowed to execute fully. However, if you're doing this, you've probably got a broken system and you're testing on a development box, so you don't care that much about that anyway.
  • Make sure nobody else is tinkering with the system; You want it to maintain the current state for as long as possible, and only change when you want it to.

If the problem "disappears" when you're eliminating certain parts you need to go back, systematically, and enable them one by one to see where the gremlins live. Once you've found the offending gremlin house, you'll proceed to locate the offending gremlins room, etc. Being systematic is crucial again.
I can admit to having thought I poinpointed a problem, but as I wasn't extraspecialsupercareful with the way I re-enabled functions, I ended up spending hours eliminating a "problem" in the wrong place. And this has happened more than once - luckily without wasting too much time.
At some point you'll end up with the exact offending line of code or configuration instruction. This is what you've been looking for. Nice!. From here on it's merely a problem of devising a cunning plan to either code around the problem, or the best solution; Fix it properly.


You're now sooooOO relieved that you feel like the king of the world and you think everything is shiny happy. But wait. Slow down a bit. You still need to deploy your fix. If you're a big corporate software house that's got too many paying customers, you'll write a thing in your knowledgebase and advice that the fix will be out in the next software release, and you go back to playing Doom III. If you care about your customers you'll figure out a way to deploy this to one server/system and have a few people help you test it. Once you're all happy that the problem has actually been fixed, then you deploy the exact same fix to the rest of the broken systems. Note that you should be super careful that you don't deploy something else. This is another thing that occurrs all the time; The problem gets properly fixed on the Dev-system and the Staging system, but for some weirdo reason the exact same fix isn't deployed to all the rest of the systems. Bad juju!
If you've made it so far that you've actually got all systems up and running, you can now waltz over to whoever has got the joyful task of informing customers and tell them that Thunderbirds Are Go! This is where you go crack open a can of Jolt Cola and relax.

But in real life

The scenario is often quite different when the fire is for real. Usually the thing goes something like:

  • Boss man is "shittin' bricks", huffing and puffing, pacing back and forth demanding to know what's going on.
  • Every man and his dog goes and loads up the same system just to see it keel over once more
  • Nobody tells the customers what's going on.
  • Developers and Support staff are trying to figure out WTF?? whilst trying to answer the boss man.
  • Someone is trying to blame a previous bugfix/upgrade/employee/dog/cat to come up with SOMETHING THAT MAKES REMOTELY SENSE (aaargh)
  • Finally someone rings the customers up and starts fabricating lies and other statistics, but the customer isn't really buying, only waiting for enought time go pass so that they can shout.
  • Morale takes a nosedive and everyone is cursing about how shitty everything is.
  • Someone in a quiet corner is surfing Jobserve....

Funny isn't it, how we just can't act rationally and deal with one problem at a time. Instead we're devising plans on how to save our own backs and doing our utmost to come up with excuses, both to ourselves and to anyone within earshot and telephone-shot. We truly are no more than animals when the pressure rises.

Of course

Naturally the best way to solve this is to never get yourself into a situation like this. This is a completely different topic, but here's a few points that might help you along.

  • Self dicipline is crucial; Don't bow to stupid, quick and dirty decisions, and be aware that any bad desicion you make will come back and bite you. Hard.
  • When it comes to time estimates make sure they stay as estimates, and do sit down and consider what needs to be done before sticking an number onto a frog and letting it leap out of your mouth.
  • When things go wrong; Learn from it. Mistakes and oversights happen, but patterns form quickly and they're bad.
  • Unit testing might solve a whole lot of problems in the long run
  • Have a dedicated testing environment and testing persons. As a developer you should never test your own mess.
  • Motivated people are efficient. Motivation doesn't have to cost money; a few hours off one afternoon can do wonders.
  • When it comes to system design; Aim for stability first, bells and whistles second. It's very difficult to stop a system falling over while it's very easy to make shiny buttons in Photoshop/The Gimp.
  • Assumption is the mother of all f==k-ups: Get used to add a lot of error checking, and never assume one variable to contain a certain value. It will go wrong.
  • Apply common sense (scary thought isn't it!)

I hope this rant has had you chuckle once or twice as we can need a bit of perking up once in a while, especially when when your system is lying on its back with the silicone implants pointing towards the sky. Remember, shit doesn't just happen, someone usually puts a lot of energy into it all failing. Have fun out there, and be careful with those function calls, they can lead to one helluva mess.