Other peoples computers you have no control over.
Google says its four-hour wobble across America and some other parts of the world on Sunday was caused by a bungled reconfiguration of its servers. The multi-hour outage affected Google services including Gmail, YouTube, Drive, and Cloud virtual-machine hosting, and knocked out apps like Uber and Snapchat that rely on the web …
Whatever happened to distributed computing, so that there is no single point of failure. This was an accident, what happens if there is a major Internet failure in one of these regions and the rest try and reroute and choke on it. The whole point of moving to the Cloud was to provide redundancy. If a single config gone rogue can do this then it doesn't say much for the quality of the fundamental architecture. Dynamically moving virtual machines around while they're still running, doing live updates etc. I'm a skeptic, if you can't tell already.
This is distributed computing.
There is always a single point of failure - the administrators.
In this case, one administrator (or set of administrators) sent a configuration to all servers.
This is a failure in their administration/configuration procedures and/or interface usability options. E.g., from a multi-select list of servers/zones/clusters, they selected the "All Zones" option rather than the "cluster XYZ" option.
This sort of thing can be properly ameliorated by better interface design. E.g. not being able to send configuration to all zones at once. Or having to use different accounts for different zones. Sure, there may be an "all zone" account, but then, you'd only login to that account if you were absolutely sending config to all zones. The account they should have been using should not have had the ability to send the configuration to all zones. Or there should be separate administration interfaces for each zone, e.g. to administer zone 1, the admin has to connect to the zone 1 management console/server, which can't send configuration commands to any other zone than zone 1. For zone 2, connect to the zone 2 management service. For All Zones, connect to the All Zone management service, and use a different set of credentials for that (as a "are you sure you really want to update all zones" in a sensible, conscious step rather than just a yes/no dialogue box).
Indeed, your point is not only valid, but highlights the vital principle that all difficult problems of computing involve human beings. Service and maintenance, human interfaces, estimation, requirements specification... all those nasty areas involve human beings relying on their vastly inferior memories, knowledge, judgment and consistency.
This parallels the wise remark I hard when I first contemplated joining the IT (then known more primitively as "computer") industry back in 1971.
The guru to whom my father introduced me over pints in a suitable hostelry said reflectively, "There can be no doubt that computing will be a very important growth industry for at least the next half century, and probably a lot longer than that. There will be many lucrative opportunities for the shrewd".
I then asked which particular career paths he would recommend, within the computer industry.
"Oh," he replied, "there are several. Management, accounting, sales, personnel..."
I worked as a Google SRE. Sorry, but no.
At any given time, my services were in about 20 DCs. That's not what we could be in, it was what we were actually in. The limit was the toil--the time required just to keep up with reserving space, initiating systems, managing traffic, and the like. From a pure design perspective, we would want to be in about twice as many data centers, as it would allow us to be in a state where we more or less always had one primary DC (and maybe three secondaries) in maintenance. This means that we would minimize excess capacity. But we almost never talked about doing that because it was just too much toil.
The key here is toil--time spent doing things by hand that don't involve long-term improvements of the system. Time spent changing logins is toil--and would be immediately automated out of effect.
Yes, sending out a change to the wrong set of DCs is a BAD THING. And like Sloss said, they are going to spend engineering effort (I would guess close to a man-year) figuring out how to minimize the likelihood of a repeat. Implementation will likely be a few times that.
But usually, such a change would be rolled back in 20-30 minutes, with traffic clearing up in five or less. Speculating slightly, it sounds to me like a big problem is that their traffic prioritization algorithms turned into a primary culprit in the outage. Specifically, their debug and perhaps even configuration change packets were not being prioritized as high priority, and so might have been dropped.
The other thing is that I know that as of a few years ago, the way they were handling traffic prioritization was relatively primitive. I did not have a chance to track down the right people to discuss what they were getting wrong & how to fix it, but one of the the results is was that prioritization was not absolute.
The other thing with cloud is, by default it isn't redundant and spread over data centers and regions.
By default you get one instance on one server in one regions. You actually have to pay extra for the resilience. A lot of companies skimp on this, think "cloud" is cloud, until it goes into titsup mode for the first time...
On the other hand, if you have everything running locally on your own servers, you only have yourself to blame if it goes down.
"The other thing with cloud is, by default it isn't redundant and spread over data centers and regions."
While this is true for end users of Googles platforms, Google should have been aware of any redundancy requirements for their systems.
My understanding of Googles data centre interconnects basically being their own significant amounts of fibre, I'm surprised they have suffered congestion. A misconfiguration of a network QoS policy might explain it particularly given the intention to apply it to one set of systems in one location/region and the issue occuring when it was pushed to other regions.
I was thinking more about the customers who had instances running on the Google Cloud, many do understand / don't read the small print and think they are resilient, until a region goes down and they realize that they only had an instance in that one region.
Obviously Google itself knows about this and does add the resilience in for its own products (although this botch put that to the test as well). But this part of the thread was more about average cloud using customers.
"Whatever happened to distributed computing, so that there is no single point of failure".
Maybe your attention has been distracted somehow? It turned, very rapidly indeed, into distributed computing that maximizes the owner's short-term profits.
As a former Google SRE, I've had discussions about what gets priority during major outages. You would be shocked (and probably pleased) to know that $ are not an immediate concern. This was not a major outage, however. The progress of the outage indicates that there was no decision to manually shut down services, or even to put services into emergency operations mode.
No body, implements separate management hardware, it’s all done via different VLANs and QoS policies defined for various networks. As Google stated they prioritised certain workloads over others it’s possible Managment network got a very tiny chunk. Aside from that the specific details are not released so it’s anyone guess, what actually went wrong. Was some backbone switches went down or Ports flapped, network loops or incorrect STP config or link aggregation went down, routes affected or BGP affected ..it could be anything and we will never know.
"Nobody". Speak for yourself Numpty, there are indeed still networks about implementing management control lans via separate infrastructure. Its niche to care enough about separation and isolation to this level of paranoia, but they do have justification for the business costs that incurs.
I too am Spartacus.
In my sector, the network control plane is kept rigorously separate from customer data, and also monitoring data (poorly implemented polling can overwhelm networks too). This isn't just separate VLANs, but entirely separate physical infrastructure which connects to the management ports of critical equipment. The control network is built to be very redundant, with liberal use of out-of-band (dial-in) equipment. Keeping things secure is non-trivial when you need to connect to equipment that possibly can't connect to its authentication server...managing pre-shared keys across a large estate of equipment and people with good reasons to need access is gnarly.
Alphabet/Google choose not to do this, probably for reasons that make sense in Google's business context. In other businesses, the level of network performance plumbed by Google would lead to pointed questions being asked and high-value long-term contracts being put at risk. There is more to networking than the Internet.
It appears they need to give their management VLANs higher CoS/QoS. Under the issues they've described they shouldn't have had trouble managing their systems.
Makes me wonder if there wasn't more to this issue because I find it hard to believe they hadn't done that. It's not the GOOG are new cloud.
That's not strictly true, many places do have separated management lans for exactly this reason.
Depends on your requirements and business cost.
If Google looked at this and said, separate management lan = $ million dollars and only lost $ thousands, then the made the right decision. If they lost more this first time it failed, then they made the wrong call.
I suspect they will be using software defined networks (SDN) and as the impact was across regions, keeping a completely separate management network may not be possible or at least a deliberate design decision to allow them to scale up. Rather than relying on individual devices (and the need to reach them if they are down), the design goal is to quickly handle failures.
At the scale Google are running where the management networks maybe pushing tens or hundreds of Gbps (ie. YouTube uploads are reportedly 400 hours of content every minute which is around 50Gbps @3-6Mbps to replicate between DC's). Assuming each DC is in the ~15MW range with ~50,000 devices (i.e. the AWS size approximations), even a 20Kbps SNMP/SSH/other management data network will reach 1Gbps.
I would assume they are using QoS to prioritise allocate bandwidth to specific services and a policy intended for a single data centre (maybe with new, higher capacity switches/NIC's) was pushed to sites that it wasn't intended to. In spite of noticing straight away, they couldn't resolve it and required local assistance (my reading of "additional help") to get things back up. As I am not a Google employee, there's a lot of assumptions in there..
Hopefully we will get more details as they are always interesting.
If you are to implement a separate management network, you will not do your youtube replication on that network. Management network is purely to be able to remote connect to the distant machines and issue commands like "reverse that configuration that I just confused up".
This post has been deleted by its author
How does this differ from normal, when anything you send to a GMail address is filtered and stripped and delayed, and recorded, before they eventually sent it on, if they can be bothered, or passed to email scammers if they can't? Does anyone still use a GMail account for anything?
Nobody employs seperate hardware for their network-manglement.
Redundancy in the cloud.
IT Engineering at it’s best, the head in a cloud of rainy bits. Some soundbytes too.
Software redundancy is a real (stress REAL) different thing than complete redundancy. I am the boss of my network, unless my network says different.
Mechanical engineers have some different view on redundancy. (Ask Boeing, lol)
Hate to break it to you, but almost ALL of G's outages are configuration errors. Former Google SRE here. Outages dropped by 80% or so during the configuration freeze at the end of the year. Every year.
Although, there was that one time that two raptors decided to land in the substation powering our Oklahoma DCs & kiss.....
This takes me back to managing a tech team responsible for a large (then) x.25 network, in general out phones would start ringing 10-15 seconds before the management console started to light up if we lost a primary link.
In some system designs, command and control is on a completely different bus than the bulk traffic.
E.g. Satellites in orbit absolutely have a dedicated backchannel, using different radio equipment and frequencies, for command and control.
Here, maybe Google could employ a serial port via a mobile phone connection (just an example) to control its servers when the network is congested.
You have to send a follow up command to confirm the changes within a timelimit, or they revert automatically!
So if the network is toast, then you get it back automatically after (back out testing permitting).
Plus automation setups that dont cross failure domains, ofc.
Former Google SRE here. Automatic reversions of config changes? No. Just no.
Way too much complexity involved with such a system. And systems that were slow to change over could get whiplash.
I get what you are suggesting, and for a single system, a lifesaver. But for tens of thousands of systems in a dozen of DCs? Nope. Nope. Nope.
"dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows,"
Funny, I always thought command-and-control functions were the most latency-sensitive traffic a system could have - but obviously not. Ya learn something new every day (unless you're running services for other people, apparently)