Clouds are great!
Except when they go down. Which seems to be happening with great regularity ... when was the last time ElReg went a week without a major cloud failure?
Google has explained how it took a big slab of its Euro-cloud offline last week, and as usual the problem was of its own making. The 9 December incident was brief and contained – it started at 18:31 PT (02:31 UTC), lasted 84 minutes and only impacted the europe-west2-a zone – but meant 60 per cent of VMs in the zone were …
why is this a cloud issue specifically? this is a problem with some technology that lots of people happen to depend on - but really, 84 minutes of lost youtube-ing... don't get me wrong, in an ideal world this wouldn't happen, but then neither would covid, or stillborn babies, or asteroids falling from the sky.
it always amazes me that people get so upset when things don't work for a short while, when really those people should be amazed all this shit works in the first place - the complexity of modern technology solutions is staggering, how many experienced designers, engineers, architects in an FMEA workshop would ponder the question "what happens to the BGP when the ACL goes awry"? the article even raises the point that "by design" the service behaved in a certain manner, suggesting that indeed people had considered the impact of "great clouds" not functioning quite perfectly for a moment, just that being humans, weren't able to walk the decision tree from every single element, to every single element, via every single element.
they will learn from this, and the "great cloud" will become a tiny bit more robust as a result, meaning the next time, we'll probably only lost our youtubes for 83 minutes, a net improvement of more than 1% by my calculator...
And the software couldn't handle being unable to access a critical file spewing out a critical error on some monitoring system? Maybe because Go doesn't have exceptions? <G>
And that's not YouTube - if it's your systems in the cloud that stop working, you may have a different opinion about a one hour and a half outage.
Office 365 was knocked out for 4+ hours for a lot of the UK on Friday and another 2 hours for the UK and some of Europe on Monday. Almost every single day there is temporary degradation of functionality somewhere on their system. Cloudflare, Amazon and Google clouds are no better, as they constantly have unreported transient failures which disappear within seconds.
I think it is long past time to admit that when you're a business with a predictable workload that having dedicated meatsacks monitoring systems in-house is far cheaper, far more reliable and overall better supported than outsourcing to large corporations.
You want to know what the largest expense for Google is? (Hint: it's biological, and not food)
It takes quite a few meat sacks to support a four nines environment. If you want more than that, it takes a LOT of them.
It is interesting to me, however, that the failures seem to focus around config distribution and quota management. Someone ought to look into that...
>the complexity of modern technology solutions is staggering,
1) You seem to think that's an inherently good thing. It's not.
>it always amazes me that people get so upset
2) People get upset when their heart monitor fails, electric car stalls, front door remains stubbornly locked, child's bedroom light won't switch on, etc. It's not the fault of those people. It's the complete insanity of designing every day products which require a live connection for even their most basic functionality. People get upset because they can't believe any product designer could be that much of a moron.
There is a choice not to become dependent this complication.
Simple example: A few weeks ago I rung up to order some material from a vendor in much the same line of business as an old client of mine in engineering supplies. My client had his stock control/sales order processing in house. This guy used some external supplier. The necessary complication got in the way of him from entering the order on his system so he had to ring me back when he regained access. That's a couple of decades of "progress".
Stick with Einstein's dictum: everything should be as complicated as necessary but not more.
how would you relate BGP to an ACL?
Very, very carefully. But it's normal for BGP. So there's some examples here for IPv6 filters-
Mainly because v6 can be.. less clear than conventional v4 filtering, masking and inverse masking, so errors can be more common. But basically BGP's.. pretty dumb. It exchanges routes between peers. It might exchange routes you'd rather it didn't, eg bogons (RFC1918 reserved addresses), internal addresses, other peers etc etc. Then BAD THINGS happen.. Like AS7007*. I still remember that given it happened on (I think) the 2nd week of me being a large ISP's router wrangler.
So controlling what BGP advertises (or accepts) from other peers gets done by various bits of BGP-fu, like prefix lists, route filters, ACLs. Then it's generally best practice (ie a LOT less f'ng dangerous/embarassing) to create/modify/sanity check those offline than just going for it on a core router's CLI. Especially given those routers often contain many ACLs. And the BGP peer will also have it's own ACLs, so if it sees stuff that it doesn't like, the peering session may get dropped.
Then more BAD THINGS happen, like if you need to get onto the router at the other end of the peering session, well, it's time to check your out-of-band connection, because if the routing session is down, that network is unreachable via BGP.. Oops.
But that's all part of the joy of networking, especially on large/complex networks. And also leads to design choices, like whether to use BGP, which is a pretty dumb EGP (Exterior Gateway Protocol) as an IGP (Interior Gateway Protocol) for internal routing, when something like IS-IS might be a better choice.
* See- https://en.wikipedia.org/wiki/AS_7007_incident
Is the Internet down? Nope, it's just currently on vacation in Florida..
I would like to be surprised at Google taking eight hours to realise their systems weren't, in fact, up. But I once had to deal with a major global telecoms provider which went for five weeks insisting that they didn't have a problem. Eventually got hold of their CEO who was less than impressed that their systems had been limping for more than a month with nobody noticing. One of their VPs got a proper chewing, but when they eventually found out (and fixed) the problem I fully understood how easily it can happen. Their real mistake was in being complacent about the measures that they had in place.
One of their VPs got a proper chewing, but when they eventually found out (and fixed) the problem I fully understood how easily it can happen. Their real mistake was in being complacent about the measures that they had in place.
Yup, that happens. All too often in places that outsource their helldesks. So I've seen that result in a situation where there's conflicting priorities, ie the outsourcer is incentivised to close tickets fast, the telco's surviving ops people, to resolve customer problems. Or in organisations that are customer focused, and put the customer first.. Yet customer facing roles like sales don't look at their customer's TTs. There's also simple tricks like alerting if a customer raises >X TTs in a month/week, which can catch intermittent faults, or ops types prematurely closing TTs.
But you can't be a customer focused service provider unless you have the tools and processes to catch problems. Preferably with event correlation, notification, automatic escalation etc. There's plenty of software to do that, one of which has been in the news recently.. But technology can also play a part, eg the move towards Ethernet services, with some customers still expecting IP-like monitoring and reporting. Or occasionally Dark Fibre customers wanting the same.. And intermittent DF faults are tricky to resolve given there's no in-band (or out-band) access to the service layer for the provider.
This outage seems to have demonstrated quite a few problems for Google. So lack of risk managemet & change control let critical services like their 'leader' box get altered and not monitored properly. Also curious about this bit-
“Once zone europe-west2-a reconnected to the network, a combination of bugs in the VPN control plane were triggered by some of the now stale VPN gateways in the zone,”
And whether that was a bug or a feature. Google's known for rolling their own network kit and software, so if that struggled with a major outage and having to teardown and re-establish a lot of VPNs, which can be rather computationally intensive.. But that's also the joy of SDNs. Being 'dynamic' or reconfigurable has benefits, but also can be a lot harder to monitor & manage than relatively fixed topologies.
i don't believe i offered an opinion on the complexity being good or bad, but having worked in less complex enterprise environments than Internet scale cloud services there is no choice but to make it complicated?
how would you relate BGP to an ACL?
In this case, it kinda has to be otherwise you end up with other bigger issues. BGP uses ACLs to describe what routes it'll accept and what it'll advertise. If you don't do that, you get that other major internet issue where someone advertises 0.0.0.0/0, and even worse, some ISP accepts it. I can't remember when that last happened, but it was really big and in the last 10 years. Even without big outages, allowing anyone to advertise a route is a bad idea since they could steal or monitor traffic.
This boils down to the "real" hard part, which is maintaining good integration and test environments that match up with production. If you do everything with automation, it's possible. But the people need to have the discipline to never edit prod or at least be very careful to edit all the environments the same way.
well while you're blushing and giggling at modern tech, your medical data could have been on that cloud service, and the downtime was while you were having open heart surgery, and the displays in the operating theatre goes "error - cannot reach server" while they need to check your patient record as to whether you have a history of aryhtmia, you know, to like save your life. 84 minutes is a LOONG time there.
The point is the cloud hype is making people gloss over what happens when it goes wrong.
You'd have a point if it were only youtube putting all their chips in the cloudy basket.
So you don't have a point.
Running a life-critical system on a cloud system, with no local backup, in one region only, would be negligent to the point there would likely be penalties for the hospital. The same is generally true of any other system where downtime is potentially harmful. It's system design.
You have to keep in mind the ways that exist for making a problem like this less likely. If you run your systems in house but you run the servers from one computer room, what do you do if the UPS in it fails and kills the power. If it will take you long enough to recover from that that you can't withstand the harm caused, then you need a redundant UPS. And possibly you'll need two computer rooms for redundancy so a flood in one doesn't take out the other. Or maybe you even need multi-building redundancy. It all depends on how long you can withstand a failure and how much you're willing to spend to make that failure less likely.
The same is the case for cloud deployments. There's a reason that every cloud has different levels of redundancy, because they have problems. In this case, only one region was affected, so having redundancy across regions would have prevented it. A sufficiently-interested user should have set that up, just as a sufficiently-interested admin should have done for systems running locally. If you're worried enough about a global outage for a cloud provider, then you would either need two cloud providers or to run the systems yourself, but if you're worried enough about a global cloud outage, your systems have to be really well-administrated and redundantly set up to make the risk level the same.
@anonymous (yeah, I know) What you are very effectively pointing out is that a lot of people (including you) don't understand the first thing about making cloud deployments reliable. As @doublelayer & @xamol are both trying to explain to you, it is quite easy to make cloud deployments far more reliable than anything on-prem.
The main issue is that people like you keep thinking that on-prem is the ultimate in reliablity when it clearly is not when compared to a correctly architected cloud deployment. At least not for anything but the very largest organizations (cf DCs in > 2 geos + competent staff to maintain 24/7/365 + significant opex budgets)
The fact is that cloud failures are public events with subsequent investigations of root causes, hence data availability for reliability is (relatively) public. With on-prem, you and I have no idea and there is no accountability.
There was no mention of on-prem.
Ultimately anything could be "designed" right - ignoring cost. This goes to the comment earlier about SLA and redundancy.
What do you do when your providers breaches their SLA commitment. Sure you can sue, but until then. Yes it is not in the vendors interest, but it wasn't here either.
At the end of the day, you can walk up to the machine with on-prem.
The right answer is probably hybrid. But that is not where the industry hype is at. It is largely an opex/ capex driven story, rather than rightly picking the benefits of cloud.
You also have insufficient standardization to move between vendors - so vendor lock in as well.
So no your point that cloud is perfect isn't valid either. There are benefits to shared infrastructure, and there are disadvantages that cloudy folk like you gloss over.
Great, now with your hybrid, no one can be accountable and you have twice the expense with 4x the complexity. That's a wonderful solution, if you work for Accenture....
Having built a couple of DCs for F500s in a previous life, I'd much rather use one or more cloud providers. Easier to deploy, more secure, way more reliable, easier to expand & update and far cheaper.
But hey, whatever. I don't have to live with what you build, thankfully....
I ran ops for $23b of projects and was the CTO of the Linux Foundation. I can safely say that unless you are one of the tech giants, your arguement for security is an illusion. The big cloud vendors have 10x more people working on security than every F500 company put together and they are fighting off nation-state resource-level attacks (and mostly winning) every minute.
And if you think that rolling your own systems isn't complex, you are fooling yourself. Just look at the number of packages on your Linux distro and show me a code trace of all the contributors. And that is just _one_ system with _one_ piece of software..... And that's assuming anyone on your staff is even capable of understand how things work.
If you're running systems in the cloud and 84 mins of outage is too much for you to accept then you need to deploy with a more resilient architecture using more than one availability zone and more than one cloud region. This isn't a new concept and anyone deploying a HA system to a single cloud region only has themselves to blame.
For anyone else who doesn't need 5 nines, then they accept the risk of outages and claim against the SLAs they signed with the cloud provider.
You're making two points - design to the SLA and what happens when the committed SLA is breached.
Even if you designed redundancy, you depend on the committed SLA. If the availability here went below the contractual SLA, your service can suffer.
Sometimes such damages are not "claimable" in a restorative way, especially as we move and becoming reliant on such cloud services, like utilities and their applications expand without thought.
The point I'm trying to make is that you should plan for outages. If your system is critical, then you have to plan accordingly and have redundancy. Whether that's active/active or active/passive you should be trying to remove SPOFs so your system should be in more than one cloud region. If your system isn't critical then you have to have a different BCP that suits your requirements. Either way, you shouldn't expect zero downtime from a cloud region just like you shouldn't expect zero downtime from a traditional DC.
If you take it to the extreme, you should probably be planning to have cross provider redundancy e.g. deploy into Azure and AWS, but that's going to come for nothing in terms of solution design and operational planning. You'd want to have a seriously critical system to warrant that effort.
This - it is VERY easy to build redundancy in cloud implementations. I've done it in 4 companies and several startups. The cost isn't particularly high unless you are doing active-active, easily less than $4k/month and as low as a few hundred.
Really, there is no excuse not to do this other than either laziness or just being clueless about how cloud infrastructure works.
Fundamentally, despite what all the haters believe (and it is only belief), it's far easier to build a secure, scalable & redundant system in the cloud that it is on-premise.
This is a dumb answer - it bad now, so no harm making it worse.
The internet depends far too much on the US. All it takes is for some political event there to break things.
You’ve factored that in your cloud deployment? No.
There are other models - for eg serviced models, that can bring benefits of technology reuse, expertise and template applications.
It does not need to be the present model, which is centralised and with a much larger attack surface.
Isn’t this outsourcing everything. It’s easier for sure.
A good design is the simplest design that gets the job done, not the easiest. That has and will always be true.
"Google’s underlying networking control plane consists of multiple distributed components that make up the Software Defined Networking (SDN) stack. These components run on multiple machines so that failure of a machine or even multiple machines does not impact network capacity. To achieve this, the control plane elects a leader from a pool of machines to provide configuration to the various infrastructure components. The leader election process depends on a local instance of Google’s internal lock service to read various configurations and files for determining the leader. The control plane is responsible for Border Gateway Protocol (BGP) peering sessions between physical routers connecting a cloud zone to the Google backbone."
See, we have this 'system stuff' which is incredibly reliable. But it's terribly complex. It turns out we don't really understand its full dynamic failure modes ourselves, but we don't admit that ;-)
"Google’s internal lock service provides Access Control List (ACLs) mechanisms to control reading and writing of various files stored in the service. A change to the ACLs used by the network control plane caused the tasks responsible for leader election to no longer have access to the files required for the process."
Someone changed some 'system stuff' and for some reason, it all fucked up :-O
"The production environment contained ACLs not present in the staging or canary environments due to those environments being rebuilt using updated processes during previous maintenance events. This meant that some of the ACLs removed in the change were in use in europe-west2-a, and the validation of the configuration change in testing and canary environments did not surface the issue."
Our 'system stuff' is so reliable that we don't really need to validate changes properly before rollout. So we didn't. We just validated any old configuration :-~
"Google's resilience strategy relies on the principle of defense in depth. Specifically, despite the network control infrastructure being designed to be highly resilient, the network is designed to 'fail static' and run for a period of time without the control plane being present as an additional line of defense against failure."
Our system stuff is incredibly reliable, so reliable that it'll kind of 'appear' to run normally, even when completely knackered! Isn't that just great? :-}
"The network ran normally for a short period - several minutes - after the control plane had been unable to elect a leader task. After this period, BGP routing between europe-west2-a and the rest of the Google backbone network was withdrawn, resulting in isolation of the zone and inaccessibility of resources in the zone."
Our completely knackered system stuff ran for several minutes. I know! Amazing! Unfortunately, during that time nobody actually managed to spot its complete knackerement because, well, why would they? They weren't even looking- our system stuff is incredibly reliable :-)
Very soon our system stuff fell over completely causing visible errors, which we weren't expecting AT ALL.
So why did our system, taken as a whole, fail to be resilient? Well, it's 'system stuff' and it's terribly complex. So. Hmmm... we don't... really... know... :-(
Biting the hand that feeds IT © 1998–2021