back to article 'Too big to fail' cloud giants like AWS threaten civilization as we know it

The world’s largest cloud players are in danger of becoming too big to fail, the boss of Canalys warned today. This is a problem because the list of potential trip hazards for the likes of Azure, AWS and Google is growing ever longer – and at least one of them is going to stumble and fall, taking customers' off-premises …

  1. Nate Amsden

    biggest DDoS threat

    is collateral damage. Not being directly targeted but being on a shared service that has someone else that is targeted.

    Lots of random people and orgs use public clouds, which just means they have bigger targets painted on their backs.

    I recall in the earlier days of amazon cloud credit card fraud was one of their biggest issues, not sure how it stacks up now.

  2. Anonymous Coward
    Anonymous Coward

    Historical reference

    see Mainframe

  3. Pirate Dave Silver badge
    Pirate

    Cool

    So once the huff-n-puff of "Cloud" wears off, and once a few companies experience major disasters by putting all their eggs in one off-site, hosted, cloud-lined basket, there will be rich pickings to be had for those of us who remember how to be actual Network Administrators. Since all the yoof will be busy picking their noses and wondering why they can't write a Javascript app that physically installs a 48-port switch into a rack...

  4. Erik4872

    Absolutely

    I am totally not a luddite even though I've spent a lot of my career focused on data centers and hardware. I know it's a good thing when you can take complexity out of the equation and build a service that functions well enough that you don't have to worry about the internals.

    BUT...I do worry a lot about new entrants into IT not getting enough education around the fundamentals. If you're a web developer these days, you aren't really interacting with the servers at a low level -- you're submitting API calls and getting the results returned to you. But, do you know _how_ the API delivered those results? Same thing goes with the core networking protocols like TCP/UDP and other fundamental services like DNS/DHCP. If those things are abstracted so far that only service providers and a few wizards know how they work, what does that do to basic troubleshooting skills? Will IT pros just throw up their hands and call the service provider to solve problems they may have been able to track down themselves previously?

    I think the article is spot on -- if talent gets too scarce, a company is going to outsource to a company that has those people -- and the salary behind that skill set will drop as more offshore people are trained. Look at mainframes -- the big outsourcing firms have tens of thousands of Indians fully trained on mainframes, while hipster startup guys thumb their noses at it and go back to writing their apps in RESTful JavaScript frameworks. This is the precise reason for the offshore firms offshoring -- they can't get domestic people at a reasonable pay rate who want to learn something that isn't cutting edge, because those same people are worried they'll be thrown out at a moment's notice and not be immediately picked up because they chose to work with an older technology.

    It's very important to allow things to get easier, after all, who wants to program in assembler when a high level language is so much easier to understand? But in the rush to containerize and abstract every single complexity, I think we might be leaving too many things buried under the abstractions.

    1. Nate Amsden

      Re: Absolutely

      core networking protocols? tcp/ip ? Shit I had a developer come to me recently with a straight face, and said he had a new app he needed deployed and he needed some back end storage for it. Sure, so I asked him a few Qs, and he answered, all sounded reasonable. So I asked them then what is his timeframe, and he said "they want it today........." (he added the dots..) Fortunately this was over online chat and I burst into laughter. I asked if the code was even written yet, he said the front end part was, the back end was not(part that stores the images and stuff). He asked if there was an API to code to, I said he should ask the development team, but I am very sure the answer is no, and the lead time to get a new API is probably a few weeks assuming they even agree to do it.

      I have worked with developers who don't even understand the basic concept of reading log files, its depressing. I mean they wrote the damn thing can't you at least look at the logs ? They often just throw up their hands and say it doesn't work I don't know what to do. I have a new saying for many developers, saying may not be the right word but it goes something like "XX is broken, I don't know why, I don't want to understand why I just want someone else to fix it for me". Often times it is in the code. I can understand non technical people behaving this way but not "developers". Certainly not all developers are like this. BUT I see it as a disturbing trend that is only increasing as the layer of abstractions increase between the underlying hardware and the software they are writing.

      How many developers even know how to look at a core dump ? (e.g. php/apache core dump). I don't think I've even worked with more than 2 that can do that in the past 8 years. They know php, but go one inch outside their comfort zone and it's like deer in headlights.

      Another favorite is I give a developer a specific SQL query that is causing problems and ask them if they can fix it -- they have no idea where that query came from (other than their code) because it's so abstracted, all they know is they called some object and the magic sauce underneath generated the SQL.

      This kind of thing is really what gave rise to things like docker. Instead of fixing broken systems like node or ruby(another depressing thing they haven't fixed ruby on some core things since the first time I had to operate it 9 years ago, now I understand it is by design and they won't fix it, anymore than Linus will fix constantly breaking Linux driver ABI). So instead of fixing the actual problem they come up with something like docker, to hide the monstrosity of dependencies and crap in a container to make it more deployable and manageable. Oh my god the first time I tried to build node I think it took me 3 days and it ended up being a version that was too old for the developer. (yes I try to build things properly I won't download some random shell script and pipe it to bash as root to do everything magic). Worst inter dependency mess I have ever seen, even put ruby to shame. One recent node upgrade I recall we had to upgrade GCC, and other core libraries which were not compatible with our OS standard, just to compile the damn thing. I mean I haven't had to go out of my way to upgrade gcc for something I want to say in 15 YEARS (I remember using 'egcs' way back when).

      Then another development team comes around and says they need something else, some other version, but that version breaks the code from the first development team..it goes on and on what a mess.

      The worst part about it is more people are using it, and more NEW people are using it not knowing how unstable it really is because that's all they know. I see developers constantly having trouble with mismatched node shit.

      Getting developers to know low level protocols, shit I'd be happy if they just knew their own damn application stack.

      1. Erik4872

        Re: Absolutely

        All valid points. I'm a little upset that this is the state of things. The problem is that people who do actually understand systems are dismissed out of hand and our knowledge is dismissed pretty easily as dinosaur crap that no one really needs to know anymore.

        I think one of the problems is that systems guys have traditionally tended towards the BOFH personality. Often that's for good reason, keeping the developers from wrecking things for everyone else. But some really bite back hard when developers ask for things. When the public cloud showed up on the scene, it immediately opened up another channel for the developers to go through to run around the sysadmins. Seriously, I'm doing an Azure-centric project now; you wouldn't believe how easy they're making it for developers to just throw whatever they want out on the Internet...it's literally push button. Forget about all the abstracted systems stuff running under the hood...it's the cloud! It's elastic! It grows and shrinks with demand seamlessly! The developers on this project seriously look at me with straight faces and say this when I try to set up some sane limits on what can be deployed and how it should work. As much as I hate the term DevOps, our two sides are going to have to merge at some point to handle some aspects of our mutual jobs. Developers can't just deploy to a magic infinite cloud without expecting some healthy pushback, and the BOFHs among us are just going to have to let some control go.

        1. Anonymous Coward
          Anonymous Coward

          Re: Absolutely

          With the breadth and depth of IT nowadays I'm not surprised lots of people don't know about lots of things in it. It's hard enough to be an expert in one area of IT.

          I don't know how to run a power station, but I sure do rely on electricity. Compute is marching to the drums of history and most people that need compute are happy about it.

          Will there be problems, unforeseen circumstances and drastic changes. Sure. But welcome to the most exciting time to be in IT since the x86 servers took over the world. I for one am quite happy to live through chaotic times as a human.

      2. Jay 2

        Re: Absolutely

        Quite recently someone in the company decided that they wanted a dashboard type front end to cover an application over several servers. For some reason, they decided to outsource this to some interns out in India. Me being the lucky sysadmin got to set up Tomcat for them and then ended up getting dragged into every problem they ran into to help sort it out.

        At some point what they had written wasn't quite working. Did they look at the Tomcat logs? No, I had to point that out. Another problem was they they had decided to pull down some code (something fairly noddy) from the Internet when the application started, but that was being blocked by the corportate firewalls. Once again I had to point out that constantly relying on an external resource was bad practice, as what would happen if one day it wasn't there? Our security policy doesn't really like non-human accounts using SSH, as it makes it hard to audit. I gave up counting how many times I had to tell them that whilst using SSH would be easier it would also not be a good idea.

        1. Anonymous Coward
          Anonymous Coward

          Re: Absolutely

          Same but from expensive professionals in Manchester. So the Indian intern thing doesn't make much difference.

          1. Kurgan

            Re: Absolutely

            Your "expensive professionals" just outsourced the job to cheap indians.

  5. Anonymous Coward
    Anonymous Coward

    I don't get it...

    Of course I do get it but in general I simply don't get how this would be different with, say, a data center or such. In one data center you'll also find hundreds if not thousands of companies who host their services there. When that goes down (whatever kind of outage) then you risk to run into the same kind of hazards.

    I do agree that the single point of failure is often bigger. Especially if cloud companies are hosting on single pieces of hardware and not actually in a virtual networked system which consists of multiple hardware 'components' (which is what most cloud providers seem to be doing, once the computer goes down then so go dozens of virtual instances).

    But too big? Nah, they'll have their asses covered with the usual legal mumbo jumbo.

    In my opinion it's not an issue of size but an issue of actually getting it right: only promising that which you actually deliver instead of pretending that your cloud solution is something which it's actually not.

    1. oldcoder

      Re: I don't get it...

      Even with distributed redundant cloud solutions (which COST more), you still get worldwide failures.

      Azure has gone down for days at a time - and not just in one data center - but ALL of them due to a propagating error.

      The same has happened to parts of the other cloud vendors.

    2. Doctor Syntax Silver badge

      Re: I don't get it...

      "But too big? Nah, they'll have their asses covered with the usual legal mumbo jumbo."

      That doesn't stop failure. It just stops legal consequences for them. The failure will still affect the businesses that use them which was the whole point of the article. If anything being insulated from legal consequences might make things worse - if you don't have to be careful for legal reasons maybe you won't be as careful as you might otherwise have been.

    3. Fan of Mr. Obvious

      Re: I don't get it...

      I would argue that Data Centers are in fact, different.If for no other reason, they are different because you own and fully manage the hardware resources. Private cloud is nothing more than your own app/OS instance(s) running on some otherwise shared resource. All the XaaS stuff has its place, but you cannot escape that you are not in control. If you do not operate your backend, best case you have a managed service contract where they racked some dedicated hardware in a DC, but that is also a far cry from AWS/Azure.

      If you run your own DC or lease space, unless you are so small it just does not matter: You should have multiple peering points that YOU chose and are paying for. You are not relying on external cloud providers load balancing your traffic with everyone else on both ingress and egress. You are not relying on DDNS to find your servers and services. You are not worried that some evergreen deploy is going to wipe out functionality that you are relying on. And it goes on.

      The "Too big to fail" is in regards to the potential damage that will be caused, meaning the need for external propping up could be required else it would do substantial financial harm to others. The contract does not mean jack at that point.

    4. Anonymous Coward
      Anonymous Coward

      Re: I don't get it...

      Everything fails (at least there is a chance that anything will fail at some point) I think in my time I've experience 3 total datacenter outages in 18 years one flood, one power and one network cables. The second two I believe both involved diggers. Then there have been countless smaller failures at server, network or rack level. Then even more problems at the server OS/Application level and third party services.

      Business needs to accept and plan for the inevitable. Even if you spend vast sums of money on Disaster recovery and prevention, even more on HA and yet more on all the bits and pieces to make everything work fine. You know what. Something will fail one day and you'll go "damn, didn't think of that"

      Everything fails, live with it and let the business decide how it will cope when that failure comes.

      Course, you could be lucky.

      1. itzman

        Re: I don't get it...

        Everything fails. Absolutely, but large monolithic systems that have a monopoly on almost all web services will be far more damaging when they fail.

        Single point of failure for global IT.

        And that, friends, is the problem with globalisation.

        Removing the watertight doors from the ship of humanity to allow free movement is fine until you get holed beneath the water line. Then you really need compartmentalisation and to restrict the free movement of stuff that damages, be it people, malware or whatever.

        I am seriously and sincerely more worried about building a post industrial global society that can be crashed with one single well targeted attack than I am about many many other things that occupy the front pages of the media.

        IN fact at heart that was the main reason to vote brexit. Systems theory shows some horrendous failure modes on huge systems of which no part can function without reference to some other part. Quasi-autonomous smaller systems are resilient: They can function - perhaps in a more limited way - if the overall structure collapses.

        Or if the overall structure is completely dysfunctional.

        What we are seeing today, politically economically and technically, is what happens when global or near-global political corporate and technical entities get subverted by narrow interests, or disabled by targeted attacks. The shock waves roll round the world.

        WE have rushed to embrace the benefits of globalisation, and its IT analogue, large distributed (but interconnected) cloud services.

        Now we are about to learn the penalty.

  6. wyatt

    There is no perfect solution. The issue I can see is that there is a lack of planning and risk assessment. As long as you're aware of the capabilities of the option you take, you should have no problem when a issue occurs. Your business is either able to deal with it, or you stop working.

    If my office loses its internet then they lose the CRM: Back to paper. If they lose their E1s: phones are re-directed to the DR site. If there is a total telecomms outage they go home, most customers have our direct numbers and would get in touch some how. Would it work? Probably. Would it be painful? Yes.

  7. steve.bradley@ctsltd.net

    What does too big to fail mean?

    Too big to fail is a phrase we have come to associate with banks and if a bank finds itself in a situation where it is likely the fail the government simply throws (aka promises) lots of our cash at them to prop them up.

    With a cloud provider it isn't that straightforward because you can't just throw cash at the problem. If a cloud provider's offering keels over then it is unlikely to be a cash issue but a deeply ingrained technical matter and as noted a domino effect can cause widespread disruption.

    So whilst they may be too big to fail, they will fail and really people need to be prepared for that, just as they would/should do if they have a single server sat in the corner of their office running everything.

    1. Naselus

      Re: What does too big to fail mean?

      "With a cloud provider it isn't that straightforward because you can't just throw cash at the problem. If a cloud provider's offering keels over then it is unlikely to be a cash issue but a deeply ingrained technical matter and as noted a domino effect can cause widespread disruption."

      I'm not sure that's true tbh.

      Sure, you can have major disasters knocking down chunks of the distributed network - a DC can be hit by a tidal wave, say - but these are short-term, temporary problems that traditional DR has long provided solutions to, and which global-scale cloud providers have the redundancy to deal with. The alternative - a major architectural screwup, say - is unlikely to be something confined to just cloud offerings. The cloud is basically just the same virtual infrastructure that 90% of us are almost certainly running in our on-site DC, only much bigger (and owned by someone else). So if we have Millennium Bug-style problem appearing in ESXi, this is going to impact you regardless of if your on-site or in someone's cloud. And the patching required to fix such issues is a lot easier for a Google or an Amazon to roll out to their own DCs than for everyone to individually discover, acquire and deploy. We're again looking at comparatively short downtimes.

      The thing that worries me about cloud is the financial side. Cloud vendors are insanely competitive at present; you can currently set your whole infrastructure up on Azure and not pay a dime for 3 years, while AWS (being Amazon) has a razor-thin cash pile on hand at any given time with most money getting quickly sunk into new projects and fixed capital. Even minor changes to the global financial system (say, a 2% interest rate rise at the Fed) could cause major financial disruption to these outfits and plunge them into loss-making territory very quickly. Combine that with the tech market's current bubble-like tendencies, and you're looking at a bunch of companies with huge fixed costs and tiny profit margins that could lose a lot of value extremely quickly. This is pretty much what happened to real estate companies in 2008, causing lots of them to bankrupt overnight and leaving large numbers of abandoned projects littering Western countries.

      While Microsoft have a cash mountain that can (and undoubtedly would) be flung into Azure at the first sign of trouble, Amazon doesn't have a similar capacity for AWS, and it's traditionally tight profit margins and rapid growth model mean the rest of the business is unlikely to be capable of propping up the cloud business for long if it turned negative. AWS is at least currently very profitable for Amazon, though - it's profit margin is around 20%, compared to about 3% for the rest of the company - so there's room for maneuver there.

      But it does seem to me that the real weak link in the chain is the financials; the intense competition between well-resourced giants who can afford to run at a loss for a decade or more, developing during a major economic contraction with extremely unusual fiscal and monetary policies, has produced an unhealthy environment that may not be capable of adjusting to more normal times or have the resources necessary to adjust to an economic shock.

      1. steve.bradley@ctsltd.net

        Re: What does too big to fail mean?

        I agree completely but for the banks you throw cash at the problem because cash is the answer. For the Cloudies not only do you need the cash which is the point you rightly made but you also need to buy the resource with that cash to sort the problem out and that could be a problem or if nothing else add to substantial delay which is long enough to bring about a fail. So to my mind, "too big to fail" has both a technical and a financial aspect which I didn't think was noted in the article itself.

  8. Anonymous Coward
    Anonymous Coward

    Lots of turkey's complaining about christmas on this thread.

    You can keep believing your skills are irreplaceable and the business can't survive without you, right up until the point you are no longer needed and replace by someone doing a better job delivering IT as a service.

    1. Anonymous Coward
      Anonymous Coward

      Re: Lots of turkey's complaining about christmas on this thread.

      "Lots of turkey's complaining about christmas on this thread. " .. some of whom use apostrophes and capital letters in a vaguely appropriate manner ...

      :-)

      1. Anonymous Coward
        Anonymous Coward

        Re: Lots of turkey's complaining about christmas on this thread.

        Maybe they can get jobs as English teachers then?

        They'll have to find something to do. ;)

    2. Doctor Syntax Silver badge

      Re: Lots of turkey's complaining about christmas on this thread.

      "right up until the point you are no longer needed and replace (sic) by someone doing a better job delivering IT as a service."

      Which they do right up until the time they break something by which time there's nobody left in house to fix it.

      If you work in an in-house operation which is critical to the business you're aware of its impact. If it goes down it's the business that provides your pay and your colleagues' pay that's at risk. In that case getting it up and running becomes your one and only priority.

      If you work in any form of out-sourced operation that operation might be critical to lots of businesses. But your priorities for getting it up and running will be concentrated on your biggest/loudest/most litigious customers. The rest can wait.

      From the perspective of a business which has outsourced but isn't in the biggest/loudest/most litigious group they've gone from "one and only" to "the rest". They won't, of course, discover this until they're too late.

  9. Velv
    Boffin

    Business Continuity

    You place sufficient service in geographically distinct data centres so you can continue business in the event of something happening that seriously effects one.

    Why should clouds be any different? You use two different cloud providers to provision business continuity.

    1. Doctor Syntax Silver badge

      Re: Business Continuity

      "You use two different cloud providers to provision business continuity."

      Both of which are at the mercy with a man with a back-hoe at the end of your road.

    2. SImon Hobson Silver badge

      Re: Business Continuity

      > You use two different cloud providers to provision business continuity.

      Does that also mean paying double the dev costs for setting it up ?

      Notice that there isn't a nice agreed-upon common way of doing stuff across providers ? Good reason for that, all of them are in the business of making it hard for your to not put all your eggs in their basket. Also, there's price barriers put up in terms of traffic going out of the cloud vs internal traffic.

      So while I agree with you completely, I strongly suspect that if you are using "cloud" rather than just "online hosting" of your own servers, there's a cost (and complexity) penalty in terms of supporting two different APIs, and regardless of how it's done, cost penalties for traffic in keeping the two systems synced.

      Hence, a lot of outfits will rely on the distribution facilities in one cloud provider for their redundancy - and we know how well that works. C.f. when not long ago some of us weren't able to access our "hosted in the EU" Office 365 mail because there was a US based authentication server with a problem !

  10. The Godfather
    Facepalm

    Is this a gripe?

    Is this a plaintive cry for the old days and a gripe that change is altering the sphere and influence of 'traditional' channel players? Nothing stands still for ever and Manufacturers, Distributors and Resellers surely have a duty to change and evolve or die.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like