The Bug that Took Down a Release and How I Wrote It

16 Apr 2008

We were disposing of some last minute bugs today and I was reminded of the time, a few projects ago, when I introduced a bug that screwed up a whole release.

Let’s say that I was writing a pizza ordering web site and, after many successful releases, the Big Wigs at PizzaCo wanted to introduce a new feature that would let customers customize their pizzas even more. In production the user could select style of pizza (thin, pan, stuffed), size, and toppings. There was even some cool logic in place to make sure the toppings were available before being offered. Now, in this brave new world of pizza, the customer would be given the chance to customize each topping. If you chose peppers, you would be given the choice of green, yellow, or red (through the magic of javascript). A choice of onions would prompt the question of red or yellow. Pepperoni? – regular or spicy?

So we wrote the feature, which was difficult, as we had to totally change how we interacted with the Topping Provider Service (TPS) in order to reserve these new toppings and check for availability. But then the Big Bosses had a thought: “What if this new extra bit of choice scares away the customer?” And their solution was to have “switch” to turn the extra customization on and off. As developers we pointed out that this would be costly as making the new system look like the old, but still work with the new under the covers (by selecting defaults for each topping), was not easy. And it would introduce extra complexity by having two different pages that do almost the same thing. But they wanted it, so we did it.

A few weeks later we found a bug with the topping selection page: If you take the last bit of pepperoni, but later come back to the toppings page to change your selections, the TPS service thinks all the pepperoni is gone (because you just reserved it) so you don’t see that option on the page. We fixed it by looking at the customer’s saved pizza and putting any toppings missing from the page back on the page. Except that we forgot to make the changes in the page that mimics the old behavior. Even worse, the way we fixed the new toppings page caused the pseudo-old page to blow up if you ever try to go back to it. And we found out that this happens right before a release.

And this was all my fault.

I worked on the feature to add the new customization, I also worked on the feature to hide the new feature, and (believe it or not) I worked on the bug fix. There was no one else on the team who was in a better position to remember that there were two pages and each need a change. Aside from my obvious point that I’m a terrible developer and you should never hire me, I think there’s an interesting lesson here about unused functionality.

We wrote the new page and then we used it. It did all sorts of sexy Ajax and was cool. And after a few weeks we forgot about the legacy page and things that might break it. There are many times when it will be tempting to bifurcate your code, but every time you do so you create a place for a bug. This is especially true if one path is rarely used. The new code will continue to evolve and the old will sit around and collect bugs. We did a full pass through regression testing, in addition to our normal tests and QA, but we didn’t find this bug until hours before the release. Which is better that finding it after, but still, if the release had gone out it would have had the new feature turned off (for “safety”) and the monster bug on. We were mere hours away from disaster.

I know it’s hard not to want options, but every option has its price.