back to article Google Cloud Engine outage caused by 'large backlog of queued mutations'

A 14-hour Google cloud platform outage that we missed in the shadow of last week's G Suite outage was caused by a failure to scale, an internal investigation has shown. The outage, which occurred on 26 March, brought down Google's cloud services in multiple regions, including Dataflow, Big Query, DialogFlow, Kubernetes Engine …

  1. Pascal Monett Silver badge

    "allow emergency configuration changes without requiring restarts."

    And how the heck do you install more memory without powering down the whole thing first ?

    It's nonsense to think that the server would be installed with maximum physical memory, then configured not to use it all. If a server needs more memory, you need to physically get the DRAMs to the server and slot them in. How can you possibly add memory without doing that ?

    And sure, I get that these are virtualized servers, but the physical box they run on still has to have the memory needed in order to increase the amount allocated to that cache server. I'm guessing we're not talking about 4GB here, but much more than that.

    1. Anonymous Coward
      Anonymous Coward

      Re: "allow emergency configuration changes without requiring restarts."

      I used HP servers 10 years ago that let you hot-swap memory (and CPU!), so not impossible. They may equally be talking about adding additional machines into a cluster without having to restart processes though.

    2. Anonymous Coward
      Anonymous Coward

      Re: "allow emergency configuration changes without requiring restarts."

      Heard of vmotion or in google case Live Migration.

    3. Anonymous Coward
      Anonymous Coward

      Re: "allow emergency configuration changes without requiring restarts."

      Hot-swapping DIMMs directly is rare, but plenty of servers (Sun/Oracle SPARC ones, for example) allow hot-swap of CPU and memory cards via dynamic reconfiguration. Disable a memory card, pull it out & upgrade it, then plug it back in.

    4. Pascal Monett Silver badge

      Thanks for the responses

      I had no idea that there were motherboards that could support hot-swappable components.

      I knew about hot-swappable HDDs/SSDs, but I thought DRAM was way too delicate for that.

      Thanks for the info.

    5. A Non e-mouse Silver badge

      Re: "allow emergency configuration changes without requiring restarts."

      Who says these are bare-metal setups? It's more likely that these are all some kind of VM, so expanding memory isn't that hard.

    6. donk1

      Re: "allow emergency configuration changes without requiring restarts."

      Your automated deployment could say deploy 16GB VMs for cache server.

      Where they get deployed physically could be anywhere on hypervisors of any size with spare resources.

      You say increase the memory on each VM to 32Gb.

      You SHOULD have unused memory in your hypervisor pools to allow for unexpected growth when you operate at the size google does.

      They have hundreds of thousands if not millions of hypervisors so keep x% free so allow for gorwth, as you use it add more hypervisors to the pool!

  2. Anonymous Coward
    Facepalm

    Bad enough

    it's bad enough when my Windows PC stops an update when I run out of memory but I expected much better from Google. Sounds like fast triumphed over right yet again.

  3. Claptrap314 Silver badge

    You are about to trigger 1,000,000 updates.

    Are you sure?

  4. Sean Hunter

    Postmortem of the incident reveals...

    It all started when the engineer making the change clicked "I'm feeling lucky"

  5. ShortLegs

    Compaq servers could hot swap SIMMS, CPUs, NICs, and RAID cards, at least 15 years ago and possibly longer. Its not new.

  6. donk1

    "Put it in the Cloud it scales and can be flexed up and down dynamically"

    Ha ha ha! How many times do we hear..."oh but that service does not flex...but it will when we fix it." or "we can make those requests much more efficient" how about writing it properly in the first place? It is all about time to code and relibability and efficency are an afterthought now.

    "The outage, which occurred on 26 March, brought down Google's cloud services in multiple regions, including Dataflow, Big Query, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, and Cloud Console."

    1 Cloud, 1 set of cache servers, no seperation to be "efficient". It won't all break at once..LOL!!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like