back to article A bug introduced 6 months ago brought Google's Cloud Load Balancer to its knees

A week after Google suffered a TITSUP*, the gang at Mountain View has published a lengthy post-mortem on what went wrong. It was a known bug in the configuration pipeline. Things went south on Tuesday 16 November after a fault in Google's cloud infrastructure made it all too clear just how many online outfits rely on it. Users …

  1. Anonymous Coward


    > It was declared a high-priority incident, but heck – the bug had been there for months without anything exploding, so a decision was taken not to opt for a same-day emergency patch, but instead roll out the fix in a steadier manner.

    > What could possibly go wrong?

    Ah, a "Heisenbug": not actually a bug until observed by an engineer!

    1. John Robson Silver badge

      Re: Heisenbug

      Not a bad decision - I would expect them to make at least one of these calls every week (probably more) that we never hear about.

      1. Version 1.0 Silver badge

        Re: Heisenbug

        It's Google, so it was probably coded as a new "feature" and nobody checked it to make sure that this didn't happen ... this is nothing new ...

        FEATURE n. 1. A surprising property of a program. Occasionally documented. To call a property a feature sometimes means the author of the program did not consider the particular case, and the program makes an unexpected, although not strictly speaking an incorrect response. See BUG. "That's not a bug, that's a feature!" A bug can be changed to a feature by documenting it. 2. A well-known and beloved property; a facility. Sometimes features are planned, but are called crocks by others. - A DECUS cookie from 1993.

        1. Richard Pennington 1

          Re: Heisenbug

          It sounds like somebody needs to remember the difference between a feature and a creature.

    2. swm

      Re: Heisenbug

      When I wrote the executive for the Dartmouth Time Sharing System I noticed that a bug that was there from day zero would be triggered maybe a year later. Once this happened, it was triggered several times a day.

      This was not because of a hacker but it just seemed the way things happened.

      For malicious attacks the statistics were different.

  2. TRT Silver badge

    Reminds me of one of my favourite films...

    The China Syndrome.

  3. _Charles_


    “Million-to-one chances...crop up nine times out of ten.”

    -- Equal Rites

  4. Doctor Syntax Silver badge

    "in very rare cases"

    AKA "inevitable".

  5. Anonymous Coward
    Anonymous Coward

    I'm "lucky." There is no "rare" bug in a system that I've never encountered while looking after it. Rare/unlikely seem to mean "guaranteed to happen on my shift." :)

  6. DS999 Silver badge

    That seems way too convenient

    I think it is a "Who, Me", but one who thought he'd be smart by shortening the change window by doing some "harmless" pre-work to prepare, and it turned out whoever designed the plan knew more about things work and already shortened the in-window work as much as possible.

    Over my years of consulting I can think of more than a few times I created carefully scripted changes that were to be executed by on call or offshore staff overnight or on a weekend, who screwed things up because they went off script. And I've seen it happen when I wasn't directly involved more times than I can count. A few cases were like this, trying to do steps they thought were harmless before the window to complete the change more quickly.

    There are always those people who think they are more clever than they are, and want to complete something planned for x time in x/2 or x/4 time so they take shortcuts from planned/established procedure thinking they will be rewarded with praise and attention. Well, they get the attention at least!

    This guy was better than most because he was somehow able to deflect blame to the "rare race condition happened a half hour before we were going to make the change"!

  7. Warm Braw

    Cloud-based load balancers

    Do I get a prize?

    Actually, I'm presuming not. The malfunctioning load-balancers being a symptom rather than diagnosis.

  8. Anonymous Coward

    What could possibly go wrong?

    Jeremy Clarkson strikes again. With predictable results.

  9. Anonymous Coward
    Anonymous Coward

    The Cloud...

    Other people's computers you have no control over

    1. Richard Pennington 1

      Re: The Cloud...

      I thought the definition of Cloud computing was subcontracting your security, availability, privacy and integrity to somebody else.

  10. fredesmite2



    using someone else's computer system .. thinking they care about it as much as you do.

  11. Anonymous Coward
    Anonymous Coward

    * Terrible IT Software Undermines Purchasing

    I always preferred 'Total Inability To Support Usual Performance' - it even seems more apt in this instance.

    1. Excellentsword (Written by Reg staff)

      We try to make it more fun

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like