back to article Microsoft Azure goes TITSUP (Total Inability To Support Usual Performance)

Microsoft is struggling to sort out an Azure cloud outage that has today left users around the world unable to access various services. According to a message posted to the Azure service status page, the outage spans "Cloud Services, Virtual Machines, Websites, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile …

  1. Anonymous Coward
    Anonymous Coward

    Maybe they rolled out their latest patch set to it......

    1. hplasm
      Devil

      Well-

      That's what happens when you run a cloud on millions of Surfaces...

    2. Anonymous Coward
      Anonymous Coward

      Well at least Microsoft had one good piece of news today. Munich council users are up in arms about how crappy things are on Linux and Open Office and they are scoping an upgrade to Windows!

      1. Destroy All Monsters Silver badge

        they are scoping an upgrade to Windows!

        Government fail ain't news.

      2. Anonymous Coward
        Anonymous Coward

        And Microsoft suggesting they might move their German HQ to Munich has nothing to do with it of course.. obviously not because MS have absolutely no record of using their cash and their weight to influence decisions, and politicians are never ever corrupt... not that I'm saying that MS are funding the second mayors political campaign through third parties but........

      3. Stretch

        downvoted for stating fact. nice.

        1. Marcus Aurelius

          Some people interpret upvotes as cheers and downvotes as boos.

  2. Alister
    Facepalm

    I received this ten days ago:

    As part of our ongoing commitment to performance, reliability, and security, we sometimes perform maintenance operations in our Microsoft Azure regions and datacenters.

    We want to notify you of an upcoming maintenance operation. We will be performing maintenance on our networking hardware. We are scheduling the update to occur during nonbusiness hours as much as possible, in each maintenance region. Single and multi-instance Virtual Machines and Cloud Services deployments will reboot once during this maintenance operation. Each instance reboot should last 30 to 45 minutes.

    The following are the planned start times, provided in both Universal Time Coordinated (UTC) and United States Pacific Daylight Time (PDT). The maintenance will be split into two windows and will impact Virtual Machines or Cloud Services in either half of the maintenance. We expect each half of the maintenance to finish within 12 hours of the start time.

    The maintenance period was from the 15th to the 17th August, so it looks as though they managed to stuff it up some how...

    1. Anonymous Coward
      Anonymous Coward

      Each instance reboot should last 30 to 45 minutes.

      Wow, that's really bad. Even when planned.

      1. Anonymous Coward
        Anonymous Coward

        It seems to take that long for Windows to boot up anyway.

  3. Someone Else Silver badge
    FAIL

    Oh Dear!

    Seems they can't fix anything these days. Or keep stuff that does work running (I'm looking at you Skype PMs!)

  4. Fred Flintstone Gold badge

    Ah, the stink of True Innovation™

    Microsoft presents the blue CLOUD of death.

    Isn't progress wonderful? I bet they'll patent this.

    1. John Bailey

      Re: Ah, the stink of True Innovation™

      Nah.. It's the blue SKY of death..

      Not a cloud to be seen.

  5. Richard Conto

    512K Day?

    Could this be some aspect of 512K day? Systems that haven't synced or are partially synced due to erratic routing?

    1. Nate Amsden

      Re: 512K Day?

      No way, Microsoft runs SDN. And SDN will save the world, it doesn't have any limitations, you could put a billion routes into it and it won't skip a beat because it's web scale.

    2. hplasm
      Windows

      Re: 512K Day?

      640k is enough day for anyone!

  6. Proud Father
    FAIL

    Put everything in the cloud...

    ...what could possibly go wrong.

    1. MyffyW Silver badge
      FAIL

      Re: Put everything in the cloud...

      You might have hoped the cloud industry would have moved beyond the single point of failure for multiple datacenters. Ho hum, won't be moving production workloads there if I can help it.

  7. Anonymous Coward
    Thumb Up

    Total Inability To Support Usual Performance (TITSUP)

    LOL Love it.

    May I have permission to officially use this acronym when describing issues to our company's customers?

    1. hplasm
      Thumb Up

      Re: Total Inability To Support Usual Performance (TITSUP)

      Absolutely! Best Heading and Acronym Evar!

      1. Anonymous Coward
        Anonymous Coward

        Re: Total Inability To Support Usual Performance (TITSUP)

        Where I used to work several years ago the acronym MIE was commonly used when something went wrong. This indicated that a Mammary Inversion Event had occurred and that we are working hard to implement another inversion to restore the tits to their accustomed downward position.

    2. diodesign (Written by Reg staff) Silver badge

      Re: Total Inability To Support Usual Performance (TITSUP)

      "May I have permission to officially use this acronym when describing issues to our company's customers?"

      Go for it: IT giants ask why we use the word 'titsup' in headlines to describe services suffering outages, some even going as far as to suggest we should stop using the word. Today we spell it out.

      C.

      1. Quinnicus

        Re: Total Inability To Support Usual Performance (TITSUP)

        Heres another one:

        Failed

        Under

        Continuous

        Testing.

    3. Ken Moorhouse Silver badge

      Re: Total Inability To Support Usual Performance (TITSUP)

      Yeah, milk it as much as you want.

      There is a problem with this acronym though, and that is the implication that if one tit goes up, there is another one to take over and maintain "normal" service.

      Is there really this kind of redundancy in place in Azure's case?

      1. Pascal Monett Silver badge

        Of course there is redundancy. And the redundant one goes TITSUP too, to maintain service consistency.

      2. Phil_Evans

        Re: Total Inability To Support Usual Performance (TITSUP)

        Not entirely.

        Azure has lots of redundancy built in. Take the 'always make an opportunity out of a crisis' team who wrote the update. Instead of saying things in pragmatic terms, like 'network services', we get a parade of the various service feature (branded) names. But of course that's another rather convenient way of partitioning the issue (much like the contracts that hang off it) into financially manageable chunks when it comes to SLA true-up. A bit like saying after the event "despite the wind blowing the shed down, the foundations were functional at all times".

        Frackers.

      3. Anonymous Coward
        Anonymous Coward

        Re: Total Inability To Support Usual Performance (TITSUP)

        "Is there really this kind of redundancy in place in Azure's case?"

        Of course. Microsoft has any number of tits.

    4. Michael H.F. Wilkinson Silver badge
      Coffee/keyboard

      Re: Total Inability To Support Usual Performance (TITSUP)

      One tea-soaked keyboard at reading that acronym

      Will include this as mandatory vocabulary in my upcoming "Introduction to Computing Science" course (along with classics like BOFH, PFY, and PEBCAK)

  8. SVV

    Another day.....

    Another failure.......

    " the outage spans "Cloud Services, Virtual Machines, Websites, Automation, Service Bus, Backup, Site …". I think describing this list as "various features" is somewhat kind to them, somewhat along the lines of a car lacking various features such as an engine, wheels, doors, seats.....

    Amyway, I would expect a cloud service not to require any downtime at all - if it had been designed properly they could have moved the VM image to anothe server in seconds before shutting down the machine to fit a new network card : which also sould not take anything like 45 minutes. Has anyone seen a service level agreement for this rubbish? Presumably you have one when relying on third parties for essential IT services?

    1. Hans 1
      Windows

      Re: Another day.....

      Yeah, but as you move the VM, something MS has had trouble with since day 1, you have to reboot the VM at least three times ... the third time you choose "Last Known Good" and a penguin appears on the top right.

  9. Hargrove

    Count on it.

    This was a comment on an earlier post "Whoops, my cloud's just gone tits up," Applicable here as well. And these are with relatively new data centres. The fun is just beginning

    ==============================

    Despite service providers pushing the reliability of their services, outages are a very likely reality for those using cloud services.

    First, there is something called the law of large numbers. Massively parallel systems at state of the art computing centres run to hundreds of thousands to millions of microprocessor cores. Even more astronomical numbers are being discussed for data centers where the goal is capacity to do lots of jobs as opposed to raw throughput.

    The presumption of solid state reliability can be seriously questioned.

    The state of the art has change dramatically since the term “solid state reliability” became common. Transistor feature sizes and component densities have all changed radically. New materials have introduced new failure mechanisms. These have been well-understood for years:

    ITRS http://www.itrs.net/Links/2005itrs/Linked%20Files/2005Files/PIDS/4377atr.pdf

    Critical Reliability Challenges for The International Technology Roadmap for Semiconductors (ITRS)

    Since then, restrictions on hazardous substances have added a new failure mechanism. Among the unintended consequences of this initiative is the spontaneous crystal formation tin of “whiskers”, that eventually short to some other part of the circuit causing failures.

    Bottom line: state-of-the-art microprocessors run 24 x 7 are going to have a limited life. Credible speculation is that this could be as short as a few years. And nobody appears to be seriously thinking about the cost of end-of-life replacement.

    The issue is not the probability that there will be a catastrophic meltdown of data centers. The problem is manageable with existing technology if cost to the customer is no option.

    The critical issue is that a small handful of large companies are effectively moving to limit the average customers’ options to reliance on large IT services companies all their information management needs.

    And then, there's bandwidth . . . a subject for another post.

    1. Anonymous Coward
      Anonymous Coward

      Re: Count on it.

      "Despite service providers pushing the reliability of their services, outages are a very likely reality for those using cloud services.

      First, there is something called the law of large numbers. Massively parallel systems at state of the art computing centres run to hundreds of thousands to millions of microprocessor cores."

      I think you're confused. This is a reason why there should be less downtime, not more. Redundancy should allow the system to keep going despite hardware failures.

      1. P. Lee

        Re: Count on it.

        Redundency is the opposite of efficiency in normal operations.

        It is the enemy of profitability and cheapness. Unless everyone pays for the same redundancy, you won't get what you want unless you do it yourself.

        The law of large numbers of customers states that no customer is very important and even small cost-cutting procedures can result in large additional profits. If you want good service you need to be important. This is not like the car industry where a product defect is covered by a manufacturer's warranty and they will have to pay real money to fix it. Neither is it like the car industry where one manufacturer's product can be switched for another's with a quick call to a rental agency.

        The upshot is: you must calculate the value of your data and not rely on third parties to get things right.

        1. Rberns

          Re: Count on it.

          "The upshot is: you must calculate the value of your data and not rely on third parties to get things right."

          So one needs to implement an on premise data centre to backup the cloud?

          1. Anonymous Coward
            Anonymous Coward

            @Rberns

            "So one needs to implement an on premise data centre to backup the cloud?"

            Thank you. I believe you have said the final word on the whole matter.

        2. Anonymous Coward
          Anonymous Coward

          Re: Count on it.

          "Redundency is the opposite of efficiency in normal operations.

          It is the enemy of profitability and cheapness."

          Microsoft's problems aren't due to cost cutting and lack of redundancy. They're due to poor engineering. The real question is, how were some database changes able to cause an immediate service-wide outage of the entire Visual Studio service? Why aren't updates staged? And is there any redundancy at all?

    2. Wade Burchette

      Re: Count on it.

      You just gave reasons #1258 and #1259 as to why the cloud is a bad thing.

    3. Roo
      Windows

      Re: Count on it.

      "Bottom line: state-of-the-art microprocessors run 24 x 7 are going to have a limited life. Credible speculation is that this could be as short as a few years. And nobody appears to be seriously thinking about the cost of end-of-life replacement."

      Precisely the premise of the early BlueGene machines. They used tried & trusted embedded cores at larger feature size & lower clock (better FLOP/W *and* higher MTBF). Superficially it looks as though BlueGene/Q is following the same path. Someone might take ARM in a similar direction, it has already been done with MIPS64 (SiCortex).

  10. Lars
    Happy

    I hope

    The airline industry will keep clear of clouds, in the sky and on the ground.

    1. EugeneFraxby

      Re: I hope

      ahem

      http://googleblog.blogspot.co.uk/2010/10/into-cloud-virgin-america-goes-google.html

      http://googleenterprise.blogspot.co.uk/2013/03/japans-all-nippon-airways-now-en-route.html

      http://googleenterprise.blogspot.co.uk/2010/02/flying-into-cloud.html

  11. oldcoder

    Anybody know if the SLAs for Azure include chargebacks for loss of business?

    1. Atomic Duetto

      The SLAs are readily available however if memory serves liability is limited to a percentage of the service cost only. Which if you're a 'free' education user with say 640k users, is zero, zip, nada. Thanks got coming, get the Gestetner out and sniff the ink.

    2. Anonymous Coward
      Anonymous Coward

      50% of service costs

      I'm aware of one company with azure services where failing to meet the SLA results in 50% reduction in charges for the month. Small consolation if your business cannot operate.

    3. Anonymous Coward
      Anonymous Coward

      @oldcoder

      "Anybody know if the SLAs for Azure include chargebacks for loss of business?"

      From what I know of M$ (more than I want to) I am guessing "not so much".

  12. harmjschoonhoven
    Meh

    Re: Anybody know if the SLAs for Azure include chargebacks for loss of business?

    Azure's SLA guarantees "at least 99.9% availability".

    That is a downtime of 365.25*24*0.001 = 8 hours 46 minutes per annum.

    1. John Tserkezis

      Re: Anybody know if the SLAs for Azure include chargebacks for loss of business?

      "That is a downtime of 365.25*24*0.001 = 8 hours 46 minutes per annum."

      This reminds us of two important factors:

      1/ Nothing is infallible.

      2/ Everything is more fallible than the marketing garb makes you might think it is.

    2. fajensen
      Coat

      Re: Anybody know if the SLAs for Azure include chargebacks for loss of business?

      The caveat is that "downtime" means whatever Microsoft's legal team need it to mean to make the numbers work! Just like the NSA is not spying on everyone e.t.c.

    3. Anonymous Coward
      Anonymous Coward

      Re: Anybody know if the SLAs for Azure include chargebacks for loss of business?

      'Azure's SLA guarantees "at least 99.9% availability"'.

      Or else... what? Guarantees are as cheap as breath. What does the customer get if the guarantee is not met?

  13. Mark 85

    I guess someone hit "UPDATE ALL" thinking it meant all the updates, not all the servers? Once upon a time, heads would roll.... literally.

    1. Julian Smart

      You mean people would literally be executed for it?

      1. Mark 85

        I wish.... <sigh> I guess I worded that a bit too strongly.

      2. Nigel 11

        The ancient empires didn't have computers, but you can bet that if they had had them, then heads would indeed have rolled. (And that was the merciful option).

        The Romans insisted that the architect stood underneath his bridge or dome as the scaffolding was removed. A better form or quality control is hard to imagine.

        1. Marcus Aurelius

          What have the Romans done for us?

          The problem with building (and system) failures is that they most often occur when the architect has secured a new job before the failure is discovered.

  14. vmcreator

    VMware Rules

    Ha, where are all the MCITP spotty VMware haters now?

    I did warn you all about MS code = rushed.

    You get what you pay for in this world, simple.

    1. Anonymous Coward
      Anonymous Coward

      Re: VMware Rules

      we really don't mind you spotty VMers

  15. Anonymous Coward
    Anonymous Coward

    Replication

    The abilty to fuck up everything everywhere really quickly

  16. Pascal Monett Silver badge
    Trollface

    I have a revolutionary idea

    Instead of putting everyone's services into a Single Point of Remote Failure, it might be interesting to explore a new venue : Distributed Computing.

    Imagine that ? If each service center had its own infrastructure and hardware, it would be isolated from exterior failures. In addition, each center could be able to implement its own rules independently from others, according to its own business case, and could design and implement the best configuration for its needs instead of relying on standards that may or may not correspond best to what it wants.

    . . .

    What, am I a few years too early ?

    1. Destroy All Monsters Silver badge
      Trollface

      Re: I have a revolutionary idea

      It's never going to catch on. People would have to retain skilled personnel to manage these things and who wants to pay for that? Additionally, quite a few vendors deliberately make this impossible. I can't imagine who would want to administrate servers using the Windows client interface for example. And imagine a Microsoft Patch Tuesday on your own installation? The horror!

  17. Infernoz Bronze badge
    Holmes

    Well, Balmer really did stuff up Microsoft after Gates became too stale and retired

    It look like the replacement still has a lot of work to do undoing all the damage Balmer did!

    Microsoft maybe in the Stagnation phase of a corporate lifecycle, and if it is, it will probably have to do an IBM like fat burn to survive, which it appears it maybe doing.

  18. pauly

    TITSUP is new?

    It was often used where i used to work 10+ years ago

  19. chivo243 Silver badge

    Maybe they Jumped the Gun

    and moved their workloads to the cloud already?

    http://www.theregister.co.uk/2014/08/19/microsoft_says_azure_not_ready_for_its_own_bizcritical_apps_yet/

    Naw, couldn't be...

  20. launcap Silver badge
    FAIL

    Cloud:

    "All the downsides of mainframes without the benefits"

  21. AndyDoran
    Pint

    Brilliant re-engineering of TITSUP. Have a Great Boo's Up on me.

  22. Hans 1
    Windows

    office 360?

    >possible other Azure Services

    So, how about Office 360? Is that down?

    1. Anonymous Coward
      Headmaster

      Re: office 360?

      Evidently for you, it's been down 5 days in the last year…

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like