back to article Google Cloud caused outage by ignoring its usual code quality protections

Google Cloud has explained the massive outage it created last week and, as has happened many times previously, admitted that it broke itself. The outage struck last Thursday and meant that Google Cloud customers could not access their rented infrastructure for at least three hours. Among the customers impacted by the event was …

  1. abend0c4 Silver badge

    Never mind the quality, feel the bandwidth

    So not only was there an untested code path in the code that was deployed, but the code that configures policy changes permits "unintended blank fields"?

  2. Anonymous Coward
    Anonymous Coward

    Null pointer brings down Google Cloud

    Good thing Google have those famously rigorous interviews to prevent this happening... Oh erm wait...

    1. Ace2 Silver badge

      Re: Null pointer brings down Google Cloud

      Yeah, and million-dollar comp packages for all of their superstars

  3. Anonymous Coward
    Anonymous Coward

    "the null pointer caused the binary to crash.”

    Coded in C then? :)

    1. Crypto Monad

      Re: "the null pointer caused the binary to crash.”

      I would guess in Go (which has the same issue with null pointers as C, and many other languages)

      1. Richard 12 Silver badge

        Re: "the null pointer caused the binary to crash.”

        Rust would do the same, it's an invariant.

        C# and C++ would throw an exception that may or may not be caught - realistically, that also means crash in most cases.

        "Memory safety" generally means "won't continue running after corruption occurs". So yes, crash out quickly.

        1. fg_swe Silver badge

          Fail Early, Stop Process

          A null pointer which immediately stops a process is much better than the tristates of C and ++ such as "use after free" and "contains random number, because it resides on stack".

          https://di-fg.de/discussion.html section D11

  4. Mishak Silver badge

    "but the code path that failed was never exercised during this rollout"

    Rollout? WTF? Surely they have to demonstrate 100% code / branch coverage before letting code out of development?

    1. Jearil

      Re: "but the code path that failed was never exercised during this rollout"

      100% code coverage on all code is unrealistic and generally not worth the expense. Obviously this should have been tested. More importantly it should have had a flag that could be simply turned off to mitigate the problem in minutes rather than hours.

      Planning for failure is much better than assuming you can prevent all failures from ever occurring. Of course you should test as much as possible, but also assume it'll fail anyway and have a plan around that.

      1. Mishak Silver badge

        100% code coverage on all code is unrealistic and generally not worth the expense

        Code that is not covered is code that is not tested. If it is not possible to achieve coverage, then one of the following must hold:

        1) The test vectors are inadequate; or

        2) The code has not been designed to be testable; or

        3) The code is unreachable / on an infeasible path and should be removed.

        100% code coverage is required for critical systems (including medical, automotive, avionics, ... ) to ensure that "surprises" to do not happen in service. Less than 100% may be acceptable for a non-critical system.

        100% coverage does not need to be achieved for a library used within a project, but it does for the test environment used to validate that library.

        Edited to cover "expense".

      2. herman Silver badge

        Re: "but the code path that failed was never exercised during this rollout"

        So, would you prefer to fly in the Boeing with untested code paths, or the Airbus with fully tested code?

    2. Mishak Silver badge

      Really?

      I'm guessing by the "thumbs down" that people don't think code and branch coverage are useful for mission-critical code?

      It can be a right PITA to achieve them - but only if the code has not been designed for test...

      1. kmorwath

        Re: Really?

        This should also be flagged by static code analysis - if you de-reference a pointer without testing if it is null/nil or not.

  5. An_Old_Dog Silver badge
    Joke

    Google Improvements

    We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues

    "Reacting to an issue": "Hey, our SPOF - Google services - just failed. We're fucked 'til they fix it. So put a blurb on the answering machines, and it's beer o' clock for us!"

    1. ChoHag Silver badge
      Pint

      Re: Google Improvements

      I'm sorry I don't get it. How is that a joke? That's just how business operates now.

    2. Someone Else Silver badge

      Re: Google Improvements

      Oh, and don't forget a healthy (?) helping of technobabble gobbledygook. Pepper the excuse explanation with enough officious-sounding gibberish so that the hoi polloi are left with the impression that this stuff is Really Hard™, so as to engender sympathy for the poor beleaguered tech entity that roundly fucked up by taking a shortcut and not testing their shit.

  6. Anonymous Coward
    Anonymous Coward

    Cut out the middleman

    Why test in prod when you can just forego the testing completely. Good stuff.

  7. trackerbelowground

    NEWSFLASH: Google has a [ignored] quality program!

  8. Trank1234

    Oh weird. Who knew QA was important for code development

  9. munnoch Silver badge

    "Google has promised to stop repeating the mistakes that led to this outage"

    And instead try different mistakes...

    1. Richard 12 Silver badge

      Re: "Google has promised to stop repeating the mistakes that led to this outage"

      One should always try to make new and exciting mistakes!

      1. Claptrap314 Silver badge

        Re: "Google has promised to stop repeating the mistakes that led to this outage"

        Much better than repeating the same old ones...

      2. Someone Else Silver badge

        Re: "Google has promised to stop repeating the mistakes that led to this outage"

        Move fast and break (new) things....

  10. Claptrap314 Silver badge
    WTF?

    Uggh

    One of the things that surprised me when I was an SRE there (almost a decade ago) was how few checks they had for their config updates--and what there were, were hacked together, not the work of a SWE. So, I'm not super-surprised about the unchecked blank fields.

    Not using a feature flag? !!!???!!! Google wrote the book on feature flags! That's incredibly sloppy.

    Nevermind adding an untested code path to prod--this is also a deep failure.

    Someone is making bad life choices.

  11. Knightlie

    What About Fixing The Code?

    I notice their list of mitigations doesn't include "Coding properly." No feature switch, no unit tests of bad data - sounds like it was vibe coded by a fake developer.

    1. herman Silver badge

      Re: What About Fixing The Code?

      It is the AI code generator’s fault.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like