back to article Google may have taken this whole 'serverless' thing too far: Outage caused by bandwidth-killing config blunder

Google says its four-hour wobble across America and some other parts of the world on Sunday was caused by a bungled reconfiguration of its servers. The multi-hour outage affected Google services including Gmail, YouTube, Drive, and Cloud virtual-machine hosting, and knocked out apps like Uber and Snapchat that rely on the web …

  1. Anonymous Coward
    Anonymous Coward

    The Cloud...

    Other peoples computers you have no control over.

    1. Anonymous Coward
      Anonymous Coward

      Re: The Cloud...

      Or in this case

      Other people’s computers THEY have no control over.

    2. stiine Silver badge

      Re: The Cloud...

      the wonderful world of wildcards in scripts and limited test environments (especially this).

    3. veti Silver badge

      Re: The Cloud...

      Only a sysadmin could write that.

      To a user, local storage is just as much "other people's computers that they have no control over".

    4. Patrician

      Re: The Cloud...

      The whole internet is made up of "Other peoples computers you have no control over.".

  2. Jay Lenovo
    Terminator

    Lemming Clouds

    Dear Cloud Minions,

    Proceed to cut off your ears and then listen for my next command.

    Automation at its worst.

  3. Adrian 4

    Someone tripped over that cable, didn't they ?

    https://www.xkcd.com/908/

  4. Anonymous Coward
    Facepalm

    Whatever happened to distributed computing?

    Whatever happened to distributed computing, so that there is no single point of failure. This was an accident, what happens if there is a major Internet failure in one of these regions and the rest try and reroute and choke on it. The whole point of moving to the Cloud was to provide redundancy. If a single config gone rogue can do this then it doesn't say much for the quality of the fundamental architecture. Dynamically moving virtual machines around while they're still running, doing live updates etc. I'm a skeptic, if you can't tell already.

    1. Phil Kingston

      Re: Whatever happened to distributed computing?

      >The whole point of moving to the Cloud was to provide redundancy

      It's one of the points, but not the entire point.

      1. Ken Moorhouse Silver badge

        Re: The whole point of moving to the Cloud was to provide redundancy

        ...to people whose jobs it displaces.

        ...and also to the people recommending it where problems like this continue to recur.

    2. eldakka

      Re: Whatever happened to distributed computing?

      This is distributed computing.

      There is always a single point of failure - the administrators.

      In this case, one administrator (or set of administrators) sent a configuration to all servers.

      This is a failure in their administration/configuration procedures and/or interface usability options. E.g., from a multi-select list of servers/zones/clusters, they selected the "All Zones" option rather than the "cluster XYZ" option.

      This sort of thing can be properly ameliorated by better interface design. E.g. not being able to send configuration to all zones at once. Or having to use different accounts for different zones. Sure, there may be an "all zone" account, but then, you'd only login to that account if you were absolutely sending config to all zones. The account they should have been using should not have had the ability to send the configuration to all zones. Or there should be separate administration interfaces for each zone, e.g. to administer zone 1, the admin has to connect to the zone 1 management console/server, which can't send configuration commands to any other zone than zone 1. For zone 2, connect to the zone 2 management service. For All Zones, connect to the All Zone management service, and use a different set of credentials for that (as a "are you sure you really want to update all zones" in a sensible, conscious step rather than just a yes/no dialogue box).

      1. Anonymous Coward
        Anonymous Coward

        Re: Whatever happened to distributed computing?

        Your way of reasoning accepts an abundance of incompetence only thought to be privvy to manglement. Are you mangler?

        1. Anonymous Coward
          Anonymous Coward

          Re: Whatever happened to distributed computing?

          Was it a mangler who made the critical mistake in this case?

      2. Anonymous Coward
        Anonymous Coward

        Re: Whatever happened to distributed computing?

        Indeed, your point is not only valid, but highlights the vital principle that all difficult problems of computing involve human beings. Service and maintenance, human interfaces, estimation, requirements specification... all those nasty areas involve human beings relying on their vastly inferior memories, knowledge, judgment and consistency.

        This parallels the wise remark I hard when I first contemplated joining the IT (then known more primitively as "computer") industry back in 1971.

        The guru to whom my father introduced me over pints in a suitable hostelry said reflectively, "There can be no doubt that computing will be a very important growth industry for at least the next half century, and probably a lot longer than that. There will be many lucrative opportunities for the shrewd".

        I then asked which particular career paths he would recommend, within the computer industry.

        "Oh," he replied, "there are several. Management, accounting, sales, personnel..."

      3. Anonymous Coward
        Anonymous Coward

        Re: Whatever happened to distributed computing?

        In the age of the “cloud” additional approval workflows must be introduced. Because minions wear horses’ spectacles most of the time.

      4. Claptrap314 Silver badge

        Re: Whatever happened to distributed computing?

        I worked as a Google SRE. Sorry, but no.

        At any given time, my services were in about 20 DCs. That's not what we could be in, it was what we were actually in. The limit was the toil--the time required just to keep up with reserving space, initiating systems, managing traffic, and the like. From a pure design perspective, we would want to be in about twice as many data centers, as it would allow us to be in a state where we more or less always had one primary DC (and maybe three secondaries) in maintenance. This means that we would minimize excess capacity. But we almost never talked about doing that because it was just too much toil.

        The key here is toil--time spent doing things by hand that don't involve long-term improvements of the system. Time spent changing logins is toil--and would be immediately automated out of effect.

        Yes, sending out a change to the wrong set of DCs is a BAD THING. And like Sloss said, they are going to spend engineering effort (I would guess close to a man-year) figuring out how to minimize the likelihood of a repeat. Implementation will likely be a few times that.

        But usually, such a change would be rolled back in 20-30 minutes, with traffic clearing up in five or less. Speculating slightly, it sounds to me like a big problem is that their traffic prioritization algorithms turned into a primary culprit in the outage. Specifically, their debug and perhaps even configuration change packets were not being prioritized as high priority, and so might have been dropped.

        The other thing is that I know that as of a few years ago, the way they were handling traffic prioritization was relatively primitive. I did not have a chance to track down the right people to discuss what they were getting wrong & how to fix it, but one of the the results is was that prioritization was not absolute.

    3. Adrian 4

      Re: Whatever happened to distributed computing?

      A fail-safe system always fails by failing to fail safe.

      -John Gall, 'Systemantics'

    4. big_D Silver badge

      Re: Whatever happened to distributed computing?

      The other thing with cloud is, by default it isn't redundant and spread over data centers and regions.

      By default you get one instance on one server in one regions. You actually have to pay extra for the resilience. A lot of companies skimp on this, think "cloud" is cloud, until it goes into titsup mode for the first time...

      On the other hand, if you have everything running locally on your own servers, you only have yourself to blame if it goes down.

      1. Anonymous Coward
        Anonymous Coward

        Re: Whatever happened to distributed computing?

        "The other thing with cloud is, by default it isn't redundant and spread over data centers and regions."

        While this is true for end users of Googles platforms, Google should have been aware of any redundancy requirements for their systems.

        My understanding of Googles data centre interconnects basically being their own significant amounts of fibre, I'm surprised they have suffered congestion. A misconfiguration of a network QoS policy might explain it particularly given the intention to apply it to one set of systems in one location/region and the issue occuring when it was pushed to other regions.

        1. big_D Silver badge

          Re: Whatever happened to distributed computing?

          I was thinking more about the customers who had instances running on the Google Cloud, many do understand / don't read the small print and think they are resilient, until a region goes down and they realize that they only had an instance in that one region.

          Obviously Google itself knows about this and does add the resilience in for its own products (although this botch put that to the test as well). But this part of the thread was more about average cloud using customers.

      2. Claptrap314 Silver badge

        Re: Whatever happened to distributed computing?

        Regional redundancy with Google systems is not the default. It is an absolute requirement for the SRE teams. This was not a problem with redundancy, however.

    5. Anonymous Coward
      Anonymous Coward

      Re: Whatever happened to distributed computing?

      "Whatever happened to distributed computing, so that there is no single point of failure".

      Maybe your attention has been distracted somehow? It turned, very rapidly indeed, into distributed computing that maximizes the owner's short-term profits.

  5. MikeGH

    Interesting choice to prioritise YouTube over business things like email, gcloud etc.

    I would have thought the best way to get bandwidth back would be to cut the bigger usage (videos)

    1. Chris G

      @ Mike GH

      Business was prioritised, the business of advertising to YouTube watchers came way before things like emails.

    2. Anonymous Coward
      Anonymous Coward

      YouTube is extremely distributed. They literally have servers within the network of every ISP containing the 1% most watched cat videos (which are 50% of the traffic) to reduce the latency and bandwidth needs.

    3. Iamnumpty
      WTF?

      Latency sensitive workloads

      Email does not need a low latent network. Email delivery doesn’t need bandwidth and latency guarantee. Most people won’t even know if an email arrives an hour later but surely you don’t want watch videos with an hour worth of gap in between two frames.

      1. Allan George Dyer

        Re: Latency sensitive workloads

        @lamnumpty - "Most people won’t even know if an email arrives an hour later"

        It depends - is the sender one of those people who phones up after 15 seconds asking "have you read my email?"

        1. eldakka

          Re: Latency sensitive workloads

          Or if it's, say, a 2-factor authentication or a password reset mail you are waiting on so you can login and make some payment before your account/service/registration/parole gets cancelled.

          1. DavidRa

            Re: Latency sensitive workloads

            That would never happen, because everyone is so organised and time-rich thanks to our new automation overlords.

      2. Anonymous Coward
        Anonymous Coward

        Re: Latency sensitive workloads

        Me concurrrrs. I’ve had gMails arriving 1 to 3 days late.

      3. Anonymous Coward
        Anonymous Coward

        Re: videos with an hour worth of gap in between two frames.

        It's like the video equivalent of slow food; probably quite the thing in some circles :-)

    4. aphexbr

      I presume that they did prioritise business - Google's business. If they lose more money by YouTube not using up 80% of bandwidth than they do by not restoring your 0.0001% of bandwidth, then they're at the front of the queue.

    5. Claptrap314 Silver badge

      As a former Google SRE, I've had discussions about what gets priority during major outages. You would be shocked (and probably pleased) to know that $ are not an immediate concern. This was not a major outage, however. The progress of the outage indicates that there was no decision to manually shut down services, or even to put services into emergency operations mode.

  6. Anonymous Coward
    Anonymous Coward

    Management Network

    Is their management network on a separate infrastructure to the front end data? From the article it doesn’t sound like it, if the load made making changes hard.

    1. Iamnumpty

      Re: Management Network

      No body, implements separate management hardware, it’s all done via different VLANs and QoS policies defined for various networks. As Google stated they prioritised certain workloads over others it’s possible Managment network got a very tiny chunk. Aside from that the specific details are not released so it’s anyone guess, what actually went wrong. Was some backbone switches went down or Ports flapped, network loops or incorrect STP config or link aggregation went down, routes affected or BGP affected ..it could be anything and we will never know.

      1. vtcodger Silver badge

        Re: Management Network

        and we will never know

        Unless. of course, it happens again ... and again ... and again ...

      2. Anonymous Coward
        Anonymous Coward

        Re: Management Network

        "Nobody". Speak for yourself Numpty, there are indeed still networks about implementing management control lans via separate infrastructure. Its niche to care enough about separation and isolation to this level of paranoia, but they do have justification for the business costs that incurs.

      3. Anonymous Coward
        Anonymous Coward

        Re: Management Network

        Uhm. Hi, I'm No body...

        Physically separate management infrastructure is standard in my sector. There is a long established principle that customer data networks are separate from our management networks.

        1. Norman Nescio Silver badge

          Re: Management Network

          I too am Spartacus.

          In my sector, the network control plane is kept rigorously separate from customer data, and also monitoring data (poorly implemented polling can overwhelm networks too). This isn't just separate VLANs, but entirely separate physical infrastructure which connects to the management ports of critical equipment. The control network is built to be very redundant, with liberal use of out-of-band (dial-in) equipment. Keeping things secure is non-trivial when you need to connect to equipment that possibly can't connect to its authentication server...managing pre-shared keys across a large estate of equipment and people with good reasons to need access is gnarly.

          Alphabet/Google choose not to do this, probably for reasons that make sense in Google's business context. In other businesses, the level of network performance plumbed by Google would lead to pointed questions being asked and high-value long-term contracts being put at risk. There is more to networking than the Internet.

      4. Reg Reader 1

        Re: Management Network

        @Iamnumpty

        It appears they need to give their management VLANs higher CoS/QoS. Under the issues they've described they shouldn't have had trouble managing their systems.

        Makes me wonder if there wasn't more to this issue because I find it hard to believe they hadn't done that. It's not the GOOG are new cloud.

      5. Anonymous Coward
        Anonymous Coward

        Re: Management Network

        That's not strictly true, many places do have separated management lans for exactly this reason.

        Depends on your requirements and business cost.

        If Google looked at this and said, separate management lan = $ million dollars and only lost $ thousands, then the made the right decision. If they lost more this first time it failed, then they made the wrong call.

      6. Alister

        Re: Management Network

        @Iamnumpty

        Out-of-band management is a thing, and quite commonly used. However, whether that would realistically scale to the sort of requirements that Google have, I don't know. It doesn't look like it.

    2. Anonymous Coward
      Anonymous Coward

      Re: Management Network

      I suspect they will be using software defined networks (SDN) and as the impact was across regions, keeping a completely separate management network may not be possible or at least a deliberate design decision to allow them to scale up. Rather than relying on individual devices (and the need to reach them if they are down), the design goal is to quickly handle failures.

      At the scale Google are running where the management networks maybe pushing tens or hundreds of Gbps (ie. YouTube uploads are reportedly 400 hours of content every minute which is around 50Gbps @3-6Mbps to replicate between DC's). Assuming each DC is in the ~15MW range with ~50,000 devices (i.e. the AWS size approximations), even a 20Kbps SNMP/SSH/other management data network will reach 1Gbps.

      I would assume they are using QoS to prioritise allocate bandwidth to specific services and a policy intended for a single data centre (maybe with new, higher capacity switches/NIC's) was pushed to sites that it wasn't intended to. In spite of noticing straight away, they couldn't resolve it and required local assistance (my reading of "additional help") to get things back up. As I am not a Google employee, there's a lot of assumptions in there..

      Hopefully we will get more details as they are always interesting.

      1. Olivier2553

        Re: Management Network

        If you are to implement a separate management network, you will not do your youtube replication on that network. Management network is purely to be able to remote connect to the distant machines and issue commands like "reverse that configuration that I just confused up".

  7. 404

    focused engineering sprint

    Right there I headdeskked... I fucking hate corporate cheerleaders...

    1. Korev Silver badge
      Windows

      Re: focused engineering sprint

      You mean "Let's fix this before we get a P45/Pink slip"?

  8. This post has been deleted by its author

  9. Securitymoose

    GMail was out? Situation normal.

    How does this differ from normal, when anything you send to a GMail address is filtered and stripped and delayed, and recorded, before they eventually sent it on, if they can be bothered, or passed to email scammers if they can't? Does anyone still use a GMail account for anything?

  10. Anonymous Coward
    Anonymous Coward

    Protecting the guilty

    How long should I leave it before submitting a Who Me?

  11. Anonymous Coward
    Anonymous Coward

    Hehehe.

    Nobody employs seperate hardware for their network-manglement.

    Hheheheheheee.

    Redundancy in the cloud.

    HAHAHAHAHAHAHAHAHA.

    IT Engineering at it’s best, the head in a cloud of rainy bits. Some soundbytes too.

    Software redundancy is a real (stress REAL) different thing than complete redundancy. I am the boss of my network, unless my network says different.

    Mechanical engineers have some different view on redundancy. (Ask Boeing, lol)

  12. Anonymous Coward
    Anonymous Coward

    Using cheap arsed solutions will always produce cheap arsed systems.

    Haven't google heard of "out of band" management?, Don't they have tech's in the datacentres? They should have been able to resolve this in seconds, not hours.

  13. 27escape
    Happy

    looking forwards to the anonymous friday write up of this one

    back in 2019, 'alice' worked for one of the large cloud providers etc...

  14. RyokuMas
    Joke

    Strangely familiar...

    So Google's outages were caused by a configuration error?

    I always said they were trying to be the next Microsoft, but this is taking it a bit too far...

    1. Claptrap314 Silver badge

      Re: Strangely familiar...

      Hate to break it to you, but almost ALL of G's outages are configuration errors. Former Google SRE here. Outages dropped by 80% or so during the configuration freeze at the end of the year. Every year.

      Although, there was that one time that two raptors decided to land in the substation powering our Oklahoma DCs & kiss.....

  15. Pen-y-gors

    Parallelised?

    and brought on additional help to parallelize restoration efforts."

    or possibly, to paralyse restoration efforts?

  16. Anonymous Coward
    Anonymous Coward

    ""Google’s engineering teams detected the issue within seconds..."

    No shit.

    "Hey, dudes, nothing works!"

    1. PM from Hell
      Holmes

      Re: ""Google’s engineering teams detected the issue within seconds..."

      This takes me back to managing a tech team responsible for a large (then) x.25 network, in general out phones would start ringing 10-15 seconds before the management console started to light up if we lost a primary link.

      1. Korev Silver badge

        Re: ""Google’s engineering teams detected the issue within seconds..."

        When I did more systems stuff we setup all kinds of alerting systems, not one of them was quickly than a scientist called Clive...

    2. Claptrap314 Silver badge

      Re: ""Google’s engineering teams detected the issue within seconds..."

      Well, given that it took twenty minutes for Azure to figure out they had a global outage, I would say this line was actually a not-so-subtle dig.

  17. JeffyPoooh
    Pint

    Network congestion slowed them

    In some system designs, command and control is on a completely different bus than the bulk traffic.

    E.g. Satellites in orbit absolutely have a dedicated backchannel, using different radio equipment and frequencies, for command and control.

    Here, maybe Google could employ a serial port via a mobile phone connection (just an example) to control its servers when the network is congested.

    .

    1. John H Woods Silver badge

      Re: Network congestion slowed them

      Indeed - it's somewhat surprising to this non-expert that they don't

  18. dnicholas

    Could be worse

    Could be every other Monday *cough* Office365 *cough*

    1. A.P. Veening Silver badge

      Re: Could be worse

      Could be every other Monday *cough* Office345 *cough*

      FTFY

  19. Justicesays

    clearly need a thing like graphics cards settings changes

    You have to send a follow up command to confirm the changes within a timelimit, or they revert automatically!

    So if the network is toast, then you get it back automatically after (back out testing permitting).

    Plus automation setups that dont cross failure domains, ofc.

    1. Claptrap314 Silver badge

      Re: clearly need a thing like graphics cards settings changes

      Former Google SRE here. Automatic reversions of config changes? No. Just no.

      Way too much complexity involved with such a system. And systems that were slow to change over could get whiplash.

      I get what you are suggesting, and for a single system, a lifesaver. But for tens of thousands of systems in a dozen of DCs? Nope. Nope. Nope.

      1. Jamie Jones Silver badge
        Happy

        Re: clearly need a thing like graphics cards settings changes

        Are you - perchance - a former Google SRE?

  20. not.known@this.address

    If it ain't broke yet, it soon will be...

    "dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows,"

    Funny, I always thought command-and-control functions were the most latency-sensitive traffic a system could have - but obviously not. Ya learn something new every day (unless you're running services for other people, apparently)

    1. Claptrap314 Silver badge

      Re: If it ain't broke yet, it soon will be...

      Sounds like the fails in their prioritization system that I tried to address while I was there haven't been fixed....

  21. exkay1

    Did they read the change implementation note and test it first? Nah, no time for that let's push it through. Who needs change control. Nothing will go wrong it's cloud.

  22. Bruno de Florence

    Meanwhile, Huawei is LOLing its head off at the Google paper tiger-

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like