Is Control Theory still on the syllabus for ICT qualifications?
The nature of control systems has changed a lot over the years but I dare say that the principles are as relevant today as they were back when I were a student.
Netflix, Tinder, Airbnb and other big names were crippled or thrown offline for millions of people when Amazon suffered what's now revealed to be a cascade of cock-ups. On Sunday, Amazon Web Services (AWS), which powers a good chunk of the internet, broke down and cut off websites from people eager to stream TV, or hookup with …
Anyone can get the basics really quickly, but I daresay that in these systems here, it is very unclear how to analyze things in terms of feedback loops. This is worse than an electric power circuit. Even finding the relevant loops seems dicey as they may change from day to day. Overprovision or build in features to kill services off in order to regain control and sod the quality of service are two possible options.
Clearly, in cloud land - no.
There are multiple proofs of it - if you look at what they are trying to do with networking it is in the same league too. The moment I look at some of their "software driven cloud network designs", it is obvious, that there is no convergence, no global minimum and it is subject to at least one positive feedback loop with a potential tailspin death of the service. However, explaining this to the cloud guy is like throwing pearls to swine. He does not get it.
it is obvious, that there is no convergence, no global minimum and it is subject to at least one positive feedback loop with a potential tailspin death of the service
I like overheated rhetoric as much as anyone, but most data-processing systems are not amenable to cybernetic analysis. Even in simple analog systems, control engineers have to go through major contortions to get something that behaves predictably.
The mathematics of working out exactly how many straws of what length may distress the camel's spine is notoriously difficult. Especially when the camel is changing weight.
Seriously in terms of feedback theory the loop is both non-linear with thresholds and has hysteresis.
This sort of issue is notoriously problematic to avoid even by professionals - I wouldn't call the nice chaps devising the TFTP protocol amateurs, yet they managed to bake exactly such a problem (see " Sorcerer's Apprentice Syndrome") right into the specification - it only got corrected years later (unfortunately, the correction pretty much breaks TFTP with u-boot to this day if packet loss exists).
But that's fine because you've built the service that provides revenue for your business to scale across multiple providers, or at least failover to another? Or at least using multiple AZs within Amazon? Right?
The only fail here is people relying on the availability of a single site. The same fail everyone has been talking about for 30+ blinking years.
No, because the cloud prevents you from needing all that. All the resilience is in the super-massive-cloud provider's infrastructure and software, thus negating your puny little business needing to try (and fail ofc) to do all that stuff.
Except it doesn't does it?
I will give you the marketing material spaffed every where leads to that assumption, but if you actually read the best practice guides etc... there is mention of Multiple AZ's and then Multiple regions if you want to "stay up no matter what". The reality is people want to see the "cost reduction" of the cloud - at the incessant moaning and bitching of cheap tight fisted tossing finance dicks, yet the truth is that in the cloud multiple regions ain't cheap. It still the same caveat "you get what you pay for".
"Staying up" is arguably the tip of the iceberg.
Another difficult problem surely occurs where it is essential that data can be input/updated at any node, and where data-locking prevents data corruption.
If any segment of the cloud goes down then any nodes in that sector would surely set all nodes on both sides of the break to be "read-only" until replication can be guaranteed.
As people have said earlier, DynamoDB is not IaaS but a managed service which claims to have resilience built in.
From the AWS FAQ:
____
Q: How highly available is Amazon DynamoDB?
The service runs across Amazon’s proven, high-availability data centers. The service replicates data across three facilities in an AWS Region to provide fault tolerance in the event of a server failure or Availability Zone outage.
Q: How does Amazon DynamoDB achieve high uptime and durability?
To achieve high uptime and durability, Amazon DynamoDB synchronously replicates data across three facilities within an AWS Region
____
Now, I grant you that multiple region replication is possible but given that this problem wasn't caused by 3 availability zones going down but rather by a capacity error within the database infrastructure it's unlikely that additional resilience would have solved anything. This problem falls very much into the 'managed service failure' category and so the fault sits with AWS.
The lesson here could be that people should stick to IaaS from Amazon and plan their own service management (especially at the scale of Netflix end Airbnb) but it certainly isn't that Amazon aren't at fault because people ignored their best practice advice, as a number of comments have suggested.
That is what you actually pay for when buying cloud. You pay for the assumption that you have paid for the mainframe of the 21st century - an infrastructure that does not fail. It walks like mainframe, it talks like a mainframe, it is a mainframe and the IBM CEO of old (Watson) who said "I think there is a world market for about five computers" is having the last belly laugh.
If you are trying to failsafe and failover cloud you have failed to grok what cloud is in the first place.
Now, the fact that the present generation of Mainf^WClouds is pretty far from what being failproof by design is an entirely different story.
You've never worked in telephony have you?
Yup the land of 5 9's or better. One day the rest of the computing world will catch up...
Just checking uptime on our last bit of legacy kit....yup 2000+ days. A lightning strike mean't we had to take it down for 15 mins to replace a power unit...no no, don't get me wrong it was still running, but the sparky didn't like working on live kit, otherwise it would be about 3000+
Like AT&T in 1990, when a cascade failure took out long-distance telephony in the US, maybe?
But those types of failure in telephony occurred maybe once a decade, and generally triggered reviews and remedial work to make sure that the same problem never happened again. Cloud failures seem to be much more frequent than that, and don't appear to have the rigorous response.
Maybe all cloud providers should learn to walk before they attempt to run!
I believe to "do cloud right" you need to be able to exit from your vender and you need to have a DR capability hosted by another vender. It's not easy and I haven't seen it done (or managed to convince any customer it's worth doing) but I'll keep expressing the opinion.
Obviously the cloud venders make it easy to get data in and hard (or expensive) to get it out but imagine if a replica environment had been available, ticking over at minimum size on, say, Azure with the ability to scale up on demand?
"I believe to "do cloud right" you need to be able to exit from your vender and you need to have a DR capability hosted by another vender. It's not easy and I haven't seen it done (or managed to convince any customer it's worth doing) but I'll keep expressing the opinion."
It's relatively easy these days: http://www.theregister.co.uk/2015/07/10/microsoft_tries_to_paint_vmware_azure_with_disaster_recover_detour/
"That is what you actually pay for when buying cloud. You pay for the assumption that you have paid for the mainframe of the 21st century - an infrastructure that does not fail"
Not in public cloud you don't. Availability is stated at 99.9% if you are lucky.
You should design your applications to be resilient. See https://msdn.microsoft.com/en-us/library/azure/dn251004.aspx for some guidelines.
>Not in public cloud you don't. Availability is stated at 99.9% if you are lucky.
And MS has been unlucky since Azure inception, because they have yet to reach that availability ... with servers that have a max uptime of 1 month, with a monthly patchday requiring multiple reboots, occasionally even bricking servers of a given type, I must say they are not doing tooo bad, still, nowhere near 99.9%.
1. Not until they manage to hire a drone who can do certificate management properly, which I doubt.
2. Not until they move their windows update test teams from China to the US or any other high tech country like UK, France, Germany .... for example. Don't get me wrong, I am sure China techies are good as well, but look what's happening to Windows Updates since they moved to China (~ release of Windows8), we have had several servers with update issues, found before updating live servers (luckily), but still. Windows workstations have had update issues several times this year alone.
From AWS Cloud Best Practices:
"Be a pessimist when designing architectures in the cloud; assume things will fail. In other words, always design, implement and deploy for automated recovery from failure. In particular, assume that your hardware will fail. Assume that outages will occur. "
Customers aren't paying for an infrastructure that does not fail - they are paying for things like elasticity, parallelism, and the transfer of capex to opex.
This thing where people come in to blame Netflix et al for not adding resilience is getting a bit old.
DynamoDB is a managed db service with resilience built in automatically. There is no multi az option for clients because it's built in.
Should people not use it now it's been demonstrated unreliable? Quite possibly. Should they avoid any Amazon managed service? Perhaps, though difficult.
Is it their fault and not Amazon's when DynamoDB fails? No.
Not knowing much about AWS is fine. Assuming that their clients are at fault when one of their managed services goes down isn't. Unless your point is that they shouldn't trust AWS at all in which case you'll likely find a solid approval base here.
The thing is Netflix do actually do this right. They do run active-active across regions and famously have Chaos Gorilla. So I'm still curious as to what went wrong for them (if indeed anything did go badly wrong and this isn't just media hype).
The rush to blame "cloud" from commentards who have no idea what they're talking about is a sad reflection of Reg readers.
"Unavailable servers continued to retry requests for membership data, maintaining high load on the metadata service."
I'd call this the root problem. No exponential backoff? AWS client APIs support exponential backoff with jitter. In other words, in case of failure a retry does not just wait x seconds then retry... it may start retrying in 1 second, then 2 seconds, then 4 seconds, doubling the delay each time. The "jitter" part means there'll be a bit of random variation in the time delays, so if the failed queries were all fired off at once, the retries won't be.
It sounds like calls from storage system to DyanmoDB were using fixed retry intervals instead of exponential backoff. Or possibly just not enough backoff. With fixed backoff, once some load limit was hit where enough calls failed *even temporarily*, then the retries would be mixed in with new calls (which when they also fail would be retried), the load would just keep getting worse and worse as more and more calls are retried. From their description of not even being able to reach the admin interface, this sounds likely. With exponential backoff with jitter, the load would increase at first as these calls are retried with short time interval, then level off and hopefully decrease as failed calls are retried less and less frequently. And if they were lucky and it was just a load spike, then (perhaps even just a few minutes later) the load could have been lower enough for new calls to succeed and the failed calls to also succeed on retry.
Ah yes, but that didn't stop old skool Ethernet going into traffic storm and catastrophic failure once saturation was reached now, did it?
That technique stops working once whenever you pick to retry, you can pretty much guarantee that someone else will hit it too. This causes a blizzard of retries followed by the metaphorical sound of a grand piano being dropped 20m onto concrete.
In heavily loaded traffic situations, a protocol with some form of arbitration[1] is required.
[1] Like maybe taking turns with an access token??
I could be wrong here but I have a hazy memory from my CompSci degree that for Ethernet CSMA/CD the whole point of the exponential backoff and retry was that your retry period was randomly chosen to be between 1 and 2 to the power of your retry attempt (based on what your slot/interval time was).
From Wikipedia:
After c collisions, a random number of slot times between 0 and 2c - 1 is chosen
Otherwise you'd have clients that clashed at the same time continually clashing as they all try again at the same time.
Anytime I've seen a "dumb" retry approach in a production system, (hey, lets wait 3 seconds between retries and give up after 3 attempts) this always springs to mind and I'm frightened by how many folk haven't a clue when I mention it.
This outage really exposes the falicy of the public cloud; everyone expects cloud capacity to scale infinitely and scale to handle any demand. The reality is that it can't.
Network folks have known this for ages; e.g. shared Internet bandwidth is unreliable; want reliability, buy private lines from different carriers with route diversification.
Time for the compute and storage folks to learn that lesson, or standup your own cloud with your own peaking capacity.
The problem is that the two kinds of democracy are incompatible.
The EU politicians are a group of inept concensus-builders, which means that anything that comes out of the EU is a bit like the big American beers: everything with substance has been removed to avoid offending anyone, which means the result is flat and pointless.
The Americans are owned by big business, which means that anything that comes out serves the interest of the dollar and not necessarily the interests of warm bodies anywhere. And when push comes to shove, the guvmint will do what it likes anyway.
Don't even get me started on polarisation as a political tactic...
I'm more and more convinced that benevolent dictatorship is my form of government of choice.
"No just (or even) DECNet - it's fundamental to CSMA-CD working properly, else everyone would keep trying to transmit at the same time.
With today's star network topology with switches and FDX links, rather than a shared bus, it's not relevant."
Uhh, yeah it is. Not usually at the network level (except wifi) but at the application level this can also be important. In this case, if the timed out requests were retried at some regular interval, then they could just keep causing load spikes and timing out at regular intervals (and if the load spike lasts longer than the retry interval you're really done.)
At the moment, cloud infrastructure seems to be owned and managed by a small chain of entities - AWS being one of them.
Do you not think there will come a time when the owners/investors in these cloud companies will think that certain aspects of their hardware, software, service provision, property ownership portfolio become undesirable in some way, and try to divest it/them? (Maybe the government wanting to split a company to combat anti-competitive practices would be a good example).
All very well saying that your cloud service is provided by Amazon, but will you be told if it does become sub-contracted? This has repercussions not only in the ability of a company in the ownership chain to enforce SLA's, etc. but in other areas outside scope of this current topic e.g., privacy. No doubt Amazon would do due diligence on their sub-contractors, but when the sub-contractors start to sub-contract that's where the messy business would no doubt begin.
Look through history for examples - particularly those created as a result of government intervention.
> (1 Website + 1 Database) + 1 Server = Nein Problem
Good luck with that next time you get a power outage, segfault, updates demanding a reboot, infrastructure - eg, internet pipe - outage, etc ad nauseam.
One datacentre is safer but still not safe.
I see a lot of pissing and moaning here but the problem with Cloud is data protection, not uptime. Pretty much every alternative has worse uptime.
Oh, quite, which is why I personally do not recommend cloud computing except in cases - for example, my current employer has requested a feasibility study on keeping energy usage and visualisation data on AWS or Azure - where the data is already safely anonymized and cannot be identified without data kept on-premises (or rather, in our data centre).
But this article is about uptime and the loss thereof, not data security. And all of the whining on this large thread is also about uptime and at least half of it is clearly from people with no fucking clue whatsoever which goes to prove that the Reg is still popular with 1st Line Support drones, I suppose.
If you strongly encrypted all the data stored on cloud drives would that solve the data security problem?
Assuming you could be certain you did not loose the encryption keys.....
Am I right in thinking only the US-East region was affected by these issues?
Very interested in folks comments....
John
>If you strongly encrypted all the data stored on cloud drives would that solve the data security problem?
If it were feasible, partially. It wouldn't save you from the Data Protection Act though.
> Assuming you could be certain you did not lose the encryption keys.....
That is also an issue. And in order to stay safe, you'd have to run the decryption at your local end which is a computational expense you might want to avoid.
My idea was to use FDE. The AWS VMs would be provided with a decryption key at boot time so thay can access data stored on the disc. The key would then be deleted
I was thinking of using EncFS. Why would this cause problems with Data Protection Act?
Please note I no longer live in the UK....
TIA
John
The DPA makes no allowance for encrypted data. If you're storing names, addresses or personally identifiable information, you ARE subject to the DPA and cannot move that data out of a "known" zone. So a datacentre is fine but "the Cloud" - which means your data sits somewhere within a whole region and not in a place you can give an address for - is a breach and can be prosecuted. And will be. The DPA is enforced pretty hard.
"Well what's more important? Security of data or your customers not being able to buy a pair of tights at 3am (which may be mid-afternoon their time)?"
There. Fixed that for you.
If you are using the cloud in a large-capacity way, you are probably providing some sort of service to many people. Movies, music, sales, blah blah. Now you could run this from your server room with a big fat pipe to handle the bandwidth; but one nasty lightning strike or flood or misplaced backhoe or trigger-happy ex-employee could take you offline. Perhaps for good.
I'm no fan of the cloud for personal computing, I prefer my photos to remain on my device; but when it comes to a large business, there may be something to be said for running the services off-site spread across computer units potentially at different sites with the ability to scale up and down according to demand. You can't expect it to be 100% available, you couldn't say the same for your own server room. That said, it is perhaps by way of embarrassing and public cock-ups like this that the system finds itself flailing in the midst of a real life stress test, and once the issues have been identified and solutions devised, may be that much more reliable in the future.
I should remind you that a huge swathes of mainland Europe were plunged into darkness in November 2006 after some Germans switched off a power line to allow a ship to pass (nothing unusual there) which somehow triggered a massive cascade failure - https://en.wikipedia.org/wiki/2006_European_blackout.
Even with well designed systems, things are going to be out of spec and things are going to go wrong. We hope the system is well enough designed to cope, but like a computer crash when the kernel stack is trashed, sometimes the only possible result is a failure. Let's hope for the next time, the situation was identified and is better handled so that it won't bring the whole thing to its knees...
AWS is quite transparent about the architecture of DynamoDB, and it's one of the key things that gives you confidence about its claims about endless scalability - if you play by its rules. But the explanation of the failure and the steps they're taking to avoid a repeat coloured in some operational details which users of the service like me would find fascinating.
Some other thoughts...
Firstly, you would hope that AWS had enough confidence in this service that they'd use it heavily internally, and we now know this is indeed the case. This is important because unlike with much software, where it's only your external customers who are there to really find the problems, the DynamoDB team have their colleagues on the other product teams to keep happy. There will be a lot of pressure on the DynamoDB team to ensure that such an important architectural cornerstone becomes even more resilient.
Secondly, the cause of the failure appears to be that certain metrics which should have been exposed, weren't, and the failure was the eventual consequence of such a metric moving outside acceptable bounds. Beyond fixing this, once the dust settles, I expect there to be efforts to identify other potential missing metrics, especially those which ought to have been added in the light of recent changes to the service. (DynamoDB has over the years improved in many useful ways - the GSIs mentioned are just one great addition).
There isn't any IT department out there that can guarantee their company is immune to an outage. The best you can hope for is that when disaster strikes, that the team resolves the issue and plans to avoid a repeat, and this is what we've seen with the way AWS has handled this outage, which gives me more confidence, not less, in a great service from them.
Let the down-votes commence!
"My idea was to use FDE. The AWS VMs would be provided with a decryption key at boot time so thay can access data stored on the disc. The key would then be deleted
I was thinking of using EncFS. Why would this cause problems with Data Protection Act?"
1) Yes it would, per some other commenters, it's about keeping control of the data, not control of *unecrypted* data.
2) Yes on a second front. How many have a data breach because a powered off server or disk is physically carried off? Very few (I recall reading about someone or other that had their server seized, and the feds could do nothing with it because it was encrypted, so it does happen. Also, it happens with portable computers, CDs, USB sticks, and tapes.) How many have a data breach because their system was hacked, asked to send all that juicy data out, and obligingly complied? Quite a few. Full disk encryption would do nothing against this attack.