back to article Google reveals version control plus not expecting zero as a value caused Gmail to take an inconvenient early holiday

Google has revealed the cause of its very unwelcome Gmail outage and on The Register’s reading of the situation it boils down to forgetting to take an obsolete version of software out of production. Google’s shorter explanation for the mess is: “Credential issuance and account metadata lookups for all Google user accounts …

  1. heyrick Silver badge

    code and infrastructure are complicated enough that the time since September 10th wasn’t enough to do the whole job

    Or maybe they were just hoping it would go away, followed by a mad rush as oh crap we really have to obey somebody else's laws...?

    1. LDS Silver badge

      Deliberate delay to put users against EU

      It's another propaganda tactic employed by Facebook. "If you miss features, it's because those Grinches from EU. Santa Zuck was ready to bring you candies, but EU blocked that"

    2. Strahd Ivarius Silver badge
      Trollface

      But... but...

      Is Facebook run by BJ team?

  2. Howard Sway

    In order to comply with the law, we needed to adjust the way our services work

    In other words, what we are doing is so blatantly shitty, that countries are now having to pass laws to stop us.

  3. Elledan Silver badge
    IT Angle

    That's what a staging setup is for

    If one wants to do critical infrastructure right, all new code and configurations have to make their way from a testing, to a staging setup before getting even close to production. Deployment is then a simple matter of copying the tested-and-verified configuration on staging to production. Staging and production environments have to be identical for this reason to prevent any surprises.

    It sounds more like Google practices the time-honoured tradition of 'production is staging', however. To have an old version of the software lying around suggests that they are not using an automatic deployment script (unless it's an intern called 'Gary') and instead probably have a haphazard semi-automatic (or even manual) procedure for updating production.

    Working for large companies this behaviour is not unusual, though. Most places I have worked for never did a 'testing' environment, instead using 'staging' as testing environment, and used production for staging. It does increase code throughput and makes it seem like everything is moving faster instead being stuck in 'staging' for weeks while issues are discovered and fixed. The trade-off with omitting staging is the influx of tickets and angry phone calls the hours and days after deployment to production, of course.

    Stuff that slipped past unit tests and local testing would end up in production and explode in spectacular fashion, to the point of devices rebooting (watchdog timer) and functionality being suddenly in broken due to environment detection gone wrong or such simple issues.

    Omitting a staging phase in deployment is like omitting the 'are you sure?' dialogue box before a disk-erasing operation. Better keep those backups updated (and tested).

    1. RockBurner

      Re: That's what a staging setup is for

      Gregarius

      Automated

      Reactionary

      Youngster

      ?

    2. My other car WAS an IAV Stryker Silver badge
      Thumb Up

      Re: That's what a staging setup is for

      In the armored vehicles biz, "staging" is known as the prototype(s). It's ours, the customer can see it but can't have it, and we test all fixes on it. We keep it around for experiments and further testing like cyber and fault induction/detection.

      Then, right off the production line(s), the first # vehicles are labelled "initial" and go to the customer's test sites. Only after that do production lines really get going making vehicles for actual field use.

      My current work is more on Army Watercraft -- ships that can move these vehicles (many from my former employer) around the globe. There's not as many ships, and we're only installing small systems -- not like a full ship overhaul -- so it's 1) testing (on land), 2) "staging" is called "fit check" and 3) "production" is called "installation".

      No matter the project/scope, anyone who skips the middle step(s) is going to regret it when they get to the final step.

      Engineering is the same way: derive requirements, design/develop, analyze if requirements are met and any design holes (e.g. vulnerabilities) exist BEFORE moving to test. Don't skip the analysis!

      1. Strahd Ivarius Silver badge
        Joke

        Re: That's what a staging setup is for

        This is not Agile!

        And that explains why your product may be a huge success compared to F-35...

  4. Mike 137 Silver badge

    Yet again - zero bounds checking

    Nobody seems to test software these days for boundary conditions, despite it being fundamental to creation of robust code. In order to provide adequate assurance, testing should not only check the results of what should happen, but also the results of what should not.

    1. HildyJ Silver badge
      Facepalm

      Re: Yet again - zero bounds checking

      Just because it's an old geezer adage, the current generation of programmers needs to stop ignoring it:

      Garbage In, Garbage Out.

    2. HereIAmJH

      Re: Yet again - zero bounds checking

      Technically, this would be range checking. Making sure the values you are manipulating are within an expected range. In this case, > 0. A pretty common occurrence is list handling. Users shouldn't ever see 'subscript out of range', for example. But coding practices have largely become let the shit fall where it may. and we'll address it in a future release.

      Bounds checking, OTOH, or the lack of it, is the more dangerous incompetency. This is where no one is checking if the data will fit in the memory allocated. And leads to buffer overflows and exploits.

      Delphi had both bounds checking and range checking since the 90s. Other languages don't. Either because someone sees them as syntactic sugar, or because it slows down their app. Seriously, there is enough other crap slowing down apps much more than bounds checking. It's time for some improved tools.

    3. Nick Ryan Silver badge

      Re: Yet again - zero bounds checking

      Without wanting to touch on the idiotic near-religious flame war that is exception vs error handling... the default approach, because it's lazy as hell, is to do no error handling at all and just let exceptions handle everything. Doesn't matter if it's a relatively expected error, it will be left to raise a cascade of meaningless exceptions until some unfortunate component in the stack-o-horrors traps the exception and will duly mask it with the ever helpful "an error has occurred" message and then fail.

    4. Mike 16 Silver badge

      Re: Yet again - zero bounds checking

      There's a reason for that.

      Once upon a time I needed to deal with files generated by a very popular sound-editing application. The file format was very well documented. I (and a "second witness") very carefully checked that my code matched the spec. Unfortunately, it immediately started reporting errors in nearly every file that had passed through that application (and only that app, not, e.g. files written by the authors of the standard). A few minutes with a hex editor revealed the problem: the very popular app did not follow the standard (basically, it started an index at zero, rather than the specified 1).

      So I had to change my code to match what the app did, rather than what the spec wanted. It's not like I could force the app's developers to meet the spec, and my customers were unlikely to value my compatibility with spec over their use of the app.

      I suspect a lot of this goes into the decision to elide error checking. Much like when one could choose to follow the (IIRC) ESMTP RFCs or interoperate with Exchange.

      1. heyrick Silver badge

        Re: Yet again - zero bounds checking

        I think there's a subtle difference in "being flexible in what you accept" and "accepting anything without bothering to check it makes sense".

    5. 's water music

      Re: Yet again - zero bounds checking

      >> testing should not only check the results of what should happen, but also the results of what should not.

      Nonsense, ...weil nicht sein kann, was nicht sein darf...

  5. David Roberts Silver badge
    Windows

    Paxos?

    No wonder they were stuffed.

  6. Ross 12

    Facebook

    The thing about Facebook is that because of their size and dominance, people assume they're professionals and know what they're doing. But I get the distinct impression that they're mostly staffed by excitable but amateur coders who think they're pioneers and treat everything as an opportunity to 'do something cool' which amounts to badly reinventing the wheel. The bugs and weird behaviours at occur in facebook's mobile app and web front-end often suggest that the architecture is an unholy un-tamable mess

    1. Nick Ryan Silver badge

      Re: Facebook

      The bugs and weird behaviours at occur in facebook's mobile app and web front-end often suggest that the architecture is an unholy un-tamable mess

      Well... this was company that confused the relatively simple job of using HTML5 with creating an unwieldy unusable JavaScript driven horror shit show and eventually had to vomit up a semi-cross platform native app. Which they paid mobile phone manufacturers to embed on phones in a largely non-removable way.

      Creating unwieldy unusable JavaScript driven horror shit shows is what the "smart kids" are pretending are web apps these days...

  7. Anonymous Coward
    Anonymous Coward

    Ahh. Unit testing. Functional testing. User Acceptance testing. All phases of deployment most companies don't even bother to pay lip service to any more.

    Had it fairly recently.... the conversation went along the lines of me asking about testing, some mumbling from the developers about 'kind of' testing it so it *should* work, at which point I asked "If that's the case, why am I on a call at 23:00 and we're discussing your code not running properly because you haven't performance tested it against a production sized dataset".

    1. krakead
      FAIL

      The corollary to this is the client insisting that the updates go live NOW despite having not been tested.

      1. Martin Gregorie Silver badge

        The corollary to this is the client insisting that the updates go live NOW despite having not been tested.

        That's easy: insist that they sign a note saying that they requested that untested code be run live and that they bear all responsibility for any problems arising.

        No signed request: no untested code gets run.

        That should make even the stupidest PHB reconsider whether this is really a good idea.

        1. Anonymous Coward
          Anonymous Coward

          I had a client call me at 8am one morning asking for updates he'd though of overnight for his web platform. I told him we could do the updates but that we wouldn't have time to test them if they needed to go live immediately. He said "just do it," so we did.

          A little while later I got a rather angry phone call after he had a rather bad meeting with an investor where his site didn't work properly in Firefox. I told him we were still testing for issues and it was his decision to put the changes live before full testing. He told he'd sue me and make sure nobody ever worked with me again (I was running a small development company at the time).

          I offered to continue testing and fix the issues but he wanted me to stop all work and handover all assets, which I did.

          Not long after I received court papers which were from a company we'd never worked for. I looked it up and the company didn't even exist. I got the case thrown out before court without ever involving a solicitor and never heard from him again. I only lost a few hundred on unpaid invoices and walked away with the determination that I would never put anything live again - no matter the customer demands - without full testing.

          I also got a quiet chuckle from the fact the customer was a former law enforcement officer who couldn't complete a simple court document correctly - or perhaps was doing so fraudulently. I also consider that I got away lightly for dropping my standards.

          Anon, because, well, it wasn't my finest moment.

    2. TimMaher Silver badge
      Facepalm

      UAT

      And the other thing about UAT is that you have a chance to discover what the actual users were hoping for rather than the crap specced by a BA on the instruction of manglement.

      1. Nick Ryan Silver badge

        Re: UAT

        Given such a remit the actual users will specify that they want a text input box to be a bit bigger and for the colours to be tweaked. Users rarely specify what they need, this usually requires someone with cross-discipline skills to specify. Often, of course, the management of said users will often specify exactly how they want something to work, platform and all, rather than what they need the system to do. Usually because they don't really know.

        1. Terry 6 Silver badge

          Re: UAT

          There's a lot in this. Users can specify what they do- how they do it and why they need to do it. Management can specify what the outcomes are meant to be. It takes both of these to develop a software tool. Developers seem (in my experience of working with some of them) to produce what they think users ought to want , based on managers' explanation of what they think the users do. The latter is seldom a full story. The former a mere fantasy.

          1. Getmo

            Re: UAT

            When I was doing internal development creating an inventory system, I realized I needed to get input from practically every class of employee just to get the full picture. First you talk with the stock girl, because she actually knows the current inventory system the best. Then of course the shop techs, who actually need to use the system to order more parts, and to track remaining quantities needed for their future delivery deadlines. They might give you unique ideas, like an option to put a part on hold that's still being ordered, so they're the first to get it. Then of course management, they can be clueless but if you offer up simple features like, "what about a cash flow report, that shows the cash value of all parts flowing into and out of the inventory that month?" it helps them feel included, 'special'. And the CEO, sometimes they may actually have a simple, good idea, like getting that same cost report the other managers get, but produced weekly, and automatically emailed to him Monday morning.

            Then after merging all your notes from these different people, you may have enough info to finally start designing this system on paper! (I couldn't imagine what it's like doing that from the external perspective.)

    3. Anonymous Coward
      Anonymous Coward

      re: most companies don't even bother to pay lip service to any more

      I've worked for a lot of companies in my career, I've never found one that didn't at least pay lipservice to testing.

      Sounds like it's your employment finding skills that are the problem.

    4. tiggity Silver badge

      Recently had a job interview - I was asking them about their testing processes in fair detail - turned out they were minimal - did not get the job as they made clear testing a low priority compared to getting code out the door

  8. SecretSonOfHG

    Have all of you misunderstood/read the Google report?

    This is an update to an authentication service, not an end user facing application. So there is little use for any "staging" or "validation" environment for anyone to test. The engineers are more than able to validate the change by themselves during service testing and load/fail over testing.

    And in case you've not read the article, the failure was due to leaving the old quota system in place instead of switching it off, and that happened back in october. Yes, that was a time bomb, but when doing these kind of infrastructure updgrades leaving the legacy system switched on instead of turning it off is best practice, as it speeds up the rollback in case you need it. The only fault on Google's side is that they forgot to switch the old system off after more than a month of running the new one, and that caused the outage.

    It is just impossible to detect that kind of ommission in any "staging" or "testing" or "pre-anything" environment. The only purpose of having the legacy system running in these environments is to test that you can rollback your changes quickly without breaking it. Unless you leave such environment running for more than month injecting a realistic worload, you'll never, ever catch the omission. Not that it does not make much sense to spend two months running a simulated production environment, as you only have to make a note of switching off the legacy system once everything is working smoothly with the new one. And remember to do it. Which wasn't the case.

  9. scrubber

    Old school

    Maybe it's time to out the beta label back on it?

    1. Jakester

      Re: Old school

      Sounds more alpha-ish.

  10. Tom Paine
    Trollface

    Come on...

    We've all been there.

  11. VTAMguy

    Facebook developers

    I have no problems at all with Facebook developers being forced to miss Christmas. I would like to see it. I have a long list of other punishments I would like to see them subjected to as well. I would also like to see the entire company fail but I guess I should just stick to small dreams.

  12. This post has been deleted by its author

  13. Stuart Halliday
    IT Angle

    Ask how Boeing do their testing then don't do that method...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021