Maybe they rolled out their latest patch set to it......
Microsoft Azure goes TITSUP (Total Inability To Support Usual Performance)
Microsoft is struggling to sort out an Azure cloud outage that has today left users around the world unable to access various services. According to a message posted to the Azure service status page, the outage spans "Cloud Services, Virtual Machines, Websites, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile …
COMMENTS
-
-
-
Tuesday 19th August 2014 06:20 GMT Anonymous Coward
And Microsoft suggesting they might move their German HQ to Munich has nothing to do with it of course.. obviously not because MS have absolutely no record of using their cash and their weight to influence decisions, and politicians are never ever corrupt... not that I'm saying that MS are funding the second mayors political campaign through third parties but........
-
Monday 18th August 2014 20:29 GMT Alister
I received this ten days ago:
As part of our ongoing commitment to performance, reliability, and security, we sometimes perform maintenance operations in our Microsoft Azure regions and datacenters.
We want to notify you of an upcoming maintenance operation. We will be performing maintenance on our networking hardware. We are scheduling the update to occur during nonbusiness hours as much as possible, in each maintenance region. Single and multi-instance Virtual Machines and Cloud Services deployments will reboot once during this maintenance operation. Each instance reboot should last 30 to 45 minutes.
The following are the planned start times, provided in both Universal Time Coordinated (UTC) and United States Pacific Daylight Time (PDT). The maintenance will be split into two windows and will impact Virtual Machines or Cloud Services in either half of the maintenance. We expect each half of the maintenance to finish within 12 hours of the start time.
The maintenance period was from the 15th to the 17th August, so it looks as though they managed to stuff it up some how...
-
-
-
Monday 18th August 2014 23:58 GMT Anonymous Coward
Re: Total Inability To Support Usual Performance (TITSUP)
Where I used to work several years ago the acronym MIE was commonly used when something went wrong. This indicated that a Mammary Inversion Event had occurred and that we are working hard to implement another inversion to restore the tits to their accustomed downward position.
-
-
Monday 18th August 2014 21:54 GMT diodesign
Re: Total Inability To Support Usual Performance (TITSUP)
"May I have permission to officially use this acronym when describing issues to our company's customers?"
Go for it: IT giants ask why we use the word 'titsup' in headlines to describe services suffering outages, some even going as far as to suggest we should stop using the word. Today we spell it out.
C.
-
Tuesday 19th August 2014 06:22 GMT Ken Moorhouse
Re: Total Inability To Support Usual Performance (TITSUP)
Yeah, milk it as much as you want.
There is a problem with this acronym though, and that is the implication that if one tit goes up, there is another one to take over and maintain "normal" service.
Is there really this kind of redundancy in place in Azure's case?
-
Tuesday 19th August 2014 12:07 GMT Phil_Evans
Re: Total Inability To Support Usual Performance (TITSUP)
Not entirely.
Azure has lots of redundancy built in. Take the 'always make an opportunity out of a crisis' team who wrote the update. Instead of saying things in pragmatic terms, like 'network services', we get a parade of the various service feature (branded) names. But of course that's another rather convenient way of partitioning the issue (much like the contracts that hang off it) into financially manageable chunks when it comes to SLA true-up. A bit like saying after the event "despite the wind blowing the shed down, the foundations were functional at all times".
Frackers.
-
-
Monday 18th August 2014 21:54 GMT SVV
Another day.....
Another failure.......
" the outage spans "Cloud Services, Virtual Machines, Websites, Automation, Service Bus, Backup, Site …". I think describing this list as "various features" is somewhat kind to them, somewhat along the lines of a car lacking various features such as an engine, wheels, doors, seats.....
Amyway, I would expect a cloud service not to require any downtime at all - if it had been designed properly they could have moved the VM image to anothe server in seconds before shutting down the machine to fit a new network card : which also sould not take anything like 45 minutes. Has anyone seen a service level agreement for this rubbish? Presumably you have one when relying on third parties for essential IT services?
-
Monday 18th August 2014 21:55 GMT Hargrove
Count on it.
This was a comment on an earlier post "Whoops, my cloud's just gone tits up," Applicable here as well. And these are with relatively new data centres. The fun is just beginning
==============================
Despite service providers pushing the reliability of their services, outages are a very likely reality for those using cloud services.
First, there is something called the law of large numbers. Massively parallel systems at state of the art computing centres run to hundreds of thousands to millions of microprocessor cores. Even more astronomical numbers are being discussed for data centers where the goal is capacity to do lots of jobs as opposed to raw throughput.
The presumption of solid state reliability can be seriously questioned.
The state of the art has change dramatically since the term “solid state reliability” became common. Transistor feature sizes and component densities have all changed radically. New materials have introduced new failure mechanisms. These have been well-understood for years:
ITRS http://www.itrs.net/Links/2005itrs/Linked%20Files/2005Files/PIDS/4377atr.pdf
Critical Reliability Challenges for The International Technology Roadmap for Semiconductors (ITRS)
Since then, restrictions on hazardous substances have added a new failure mechanism. Among the unintended consequences of this initiative is the spontaneous crystal formation tin of “whiskers”, that eventually short to some other part of the circuit causing failures.
Bottom line: state-of-the-art microprocessors run 24 x 7 are going to have a limited life. Credible speculation is that this could be as short as a few years. And nobody appears to be seriously thinking about the cost of end-of-life replacement.
The issue is not the probability that there will be a catastrophic meltdown of data centers. The problem is manageable with existing technology if cost to the customer is no option.
The critical issue is that a small handful of large companies are effectively moving to limit the average customers’ options to reliance on large IT services companies all their information management needs.
And then, there's bandwidth . . . a subject for another post.
-
Monday 18th August 2014 22:32 GMT Anonymous Coward
Re: Count on it.
"Despite service providers pushing the reliability of their services, outages are a very likely reality for those using cloud services.
First, there is something called the law of large numbers. Massively parallel systems at state of the art computing centres run to hundreds of thousands to millions of microprocessor cores."
I think you're confused. This is a reason why there should be less downtime, not more. Redundancy should allow the system to keep going despite hardware failures.
-
Tuesday 19th August 2014 01:22 GMT P. Lee
Re: Count on it.
Redundency is the opposite of efficiency in normal operations.
It is the enemy of profitability and cheapness. Unless everyone pays for the same redundancy, you won't get what you want unless you do it yourself.
The law of large numbers of customers states that no customer is very important and even small cost-cutting procedures can result in large additional profits. If you want good service you need to be important. This is not like the car industry where a product defect is covered by a manufacturer's warranty and they will have to pay real money to fix it. Neither is it like the car industry where one manufacturer's product can be switched for another's with a quick call to a rental agency.
The upshot is: you must calculate the value of your data and not rely on third parties to get things right.
-
Tuesday 19th August 2014 09:58 GMT Anonymous Coward
Re: Count on it.
"Redundency is the opposite of efficiency in normal operations.
It is the enemy of profitability and cheapness."
Microsoft's problems aren't due to cost cutting and lack of redundancy. They're due to poor engineering. The real question is, how were some database changes able to cause an immediate service-wide outage of the entire Visual Studio service? Why aren't updates staged? And is there any redundancy at all?
-
-
Tuesday 2nd September 2014 23:43 GMT Roo
Re: Count on it.
"Bottom line: state-of-the-art microprocessors run 24 x 7 are going to have a limited life. Credible speculation is that this could be as short as a few years. And nobody appears to be seriously thinking about the cost of end-of-life replacement."
Precisely the premise of the early BlueGene machines. They used tried & trusted embedded cores at larger feature size & lower clock (better FLOP/W *and* higher MTBF). Superficially it looks as though BlueGene/Q is following the same path. Someone might take ARM in a similar direction, it has already been done with MIPS64 (SiCortex).
-
-
-
Tuesday 19th August 2014 01:41 GMT John Tserkezis
Re: Anybody know if the SLAs for Azure include chargebacks for loss of business?
"That is a downtime of 365.25*24*0.001 = 8 hours 46 minutes per annum."
This reminds us of two important factors:
1/ Nothing is infallible.
2/ Everything is more fallible than the marketing garb makes you might think it is.
-
-
-
-
Tuesday 19th August 2014 09:00 GMT Nigel 11
The ancient empires didn't have computers, but you can bet that if they had had them, then heads would indeed have rolled. (And that was the merciful option).
The Romans insisted that the architect stood underneath his bridge or dome as the scaffolding was removed. A better form or quality control is hard to imagine.
-
-
Tuesday 19th August 2014 07:57 GMT Pascal Monett
I have a revolutionary idea
Instead of putting everyone's services into a Single Point of Remote Failure, it might be interesting to explore a new venue : Distributed Computing.
Imagine that ? If each service center had its own infrastructure and hardware, it would be isolated from exterior failures. In addition, each center could be able to implement its own rules independently from others, according to its own business case, and could design and implement the best configuration for its needs instead of relying on standards that may or may not correspond best to what it wants.
. . .
What, am I a few years too early ?
-
Wednesday 20th August 2014 06:02 GMT Destroy All Monsters
Re: I have a revolutionary idea
It's never going to catch on. People would have to retain skilled personnel to manage these things and who wants to pay for that? Additionally, quite a few vendors deliberately make this impossible. I can't imagine who would want to administrate servers using the Windows client interface for example. And imagine a Microsoft Patch Tuesday on your own installation? The horror!
-
-
Tuesday 19th August 2014 08:36 GMT Infernoz
Well, Balmer really did stuff up Microsoft after Gates became too stale and retired
It look like the replacement still has a lot of work to do undoing all the damage Balmer did!
Microsoft maybe in the Stagnation phase of a corporate lifecycle, and if it is, it will probably have to do an IBM like fat burn to survive, which it appears it maybe doing.