
Not all bad news then
"The outage even prevented the interactive edition of the Daily Mail - Daily Mail Plus- from appearing."
You make it sound like a bad thing.
“We apologise for the disruption. We have identified the cause and are working to restore the service as quickly as possible.” Attempting to log onto your cloud service and being faced with a message like that is guaranteed to strike fear into the heart of anybody that has trusted all or just part of their company's CRM, email …
Now a year is not exactly 365 days, but if it were then that would be 525600 minutes. At four nines that allows for an outage of 5256 minutes or 87.6 hours. SLAs calculated on an annual basis are worthless. The same service level would allow for an outage of 7.44 hours before being triggered if worked on a monthly basis, which is more reasonable.
All of the above is of course meaningless if there's no (or trivial) compensation in the event that the service level is breached, which is the case with most SaaS offerings.
One must not however confuse SaaS with cloud. It's quite possible to get a robust infrastructure in the cloud by using two or more infrastructure providers and installing your own business software. That's why SugarCRM is infinitely preferrable to SalesForce. You are in control be it in the cloud or on your own infrastructure.
"then that would be 525600 minutes"
I think the author meant a 99.99 percent or 9,999 per 10,000. Agreed with the rest of your post, though.
I'd also like to call fellow's commentards attention to the fact that the cloud provider is not the only possible source of downtime for the customer. Screw ups by the telcos probably will cause localized outages more often than the cloud provider does.
four nines that allows for an outage of 5256 minutes
No, 53 minutes, it's 99.99%. 5-nines is generally taken as no more than 5 minutes downtime per year, or more realistically 1 hour per 10 years, since few people install services for only a year.
A bigger problem is that such a simple calculation only works for a total outage. What if your network is struggling due to, say, a DDoS on the cloud provider, but some traffic is getting through? Or some of your apps are running but some aren't? What number of nines does that give you, and how do you write an SLA for it?
That's like comparing theft to copyright infringement.
If your problem is power, then what you need is a backup diesel generator (or however many are required to cover your needs). Insert it into the grid, fill it up, put it on standby and you're done, apart from the regular maintenance and trial runs. Frankly, apart from the cost, this is a no-brainer operation (and yet, some still manage to fudge it up anyway).
That is peanuts in price and hassle compared to a cloud outage. Even if you do go for a backup cloud operator (and we're talking big budget operations right there), there will be a boatload of problems to deal with on the spot when (not if) it happens.
There are internal procedures to devise, which will need to be amended after the first live-fire event (because there's always some difficulty that was not taken into account).
There is (company) user training, because said procedures need to be understood and implemented in an urgent situation. There is proper warning and communications, because the switch cannot be made before it can be, and (company) users switching manually on their own willy-nilly is going to create its own special brand of havoc.
There is monitoring that the switch has taken place and that operations are once again in a working state. What are the metrics ? How to measure them in a time of crisis ? How to ensure that all required functions have been taken into account ?
Finally, there is recovering from the outage, and the decisions that need to be taken - mainly do we switch back again, or do we only switch when this cloud fails ? After the first live-fire event, maybe previous policy decisions will be reviewed in light of performance before and after the switch.
Then there will be the accounting fallout, because all of this hoopla will be quantified and cost-assigned, and the next board meeting will be a live-fire event of its own.
No, comparing with a power cut doesn't even begin to do this kind of thing justice. It is a very poor comparison.
Despite service providers pushing the reliability of their services, outages are a very likely reality for those using cloud services.
First, there is something called the law of large numbers. Massively parallel systems at state of the art computing centres run to hundreds of thousands to millions of microprocessor cores. Even more astronomical numbers are being discussed for data centers where the goal is capacity to do lots of jobs as opposed to raw throughput.
The presumption of solid state reliability can be seriously questioned.
The state of the art has change dramatically since the term “solid state reliability” became common. Transistor feature sizes and component densities have all changed radically. New materials have introduced new failure mechanisms. These have been well-understood for years:
ITRS http://www.itrs.net/Links/2005itrs/Linked%20Files/2005Files/PIDS/4377atr.pdf
Critical Reliability Challenges for The International Technology Roadmap for Semiconductors (ITRS)
Since then, restrictions on hazardous substances have added a new failure mechanism. Among the unintended consequences of this initiative is the spontaneous crystal formation tin of “whiskers”, that eventually short to some other part of the circuit causing failures.
Bottom line: state-of-the-art microprocessors run 24 x 7 are going to have a limited life. Credible speculation is that this could be as short as a few years. And nobody appears to be seriously thinking about the cost of end-of-life replacement.
The issue is not the probability that there will be a catastrophic meltdown of data centers. The problem is manageable with existing technology if cost to the customer is no option.
The critical issue is that a small handful of large companies are effectively moving to limit the average customers’ options to reliance on large IT services companies all their information management needs.
And then, there's bandwidth . . . a subject for another post.
Large data centers cost hundreds of millions to billions to construct. At the moment the Cloud has to compete with local alternatives. . . which include my ability to buy a hard drive for more terabytes of data than I can envision using for a few hundred dollars.
This going to make redundancy as a solution to reliability issues a touch challenge. I'm not at all sanguine that at a half billion a pop, industry is going to build excess unused capacity.
Unless, of course, they can contrive to create a virtual monopoly and dependence where they can demand what the traffic will bear.
And then, there's the bandwidth . . .
There is no such thing as a free lunch. The notion of achieving reliability in a flexible cloud is all well and good. There are two problems . . . first, the use of a flexible cloud presumes the existence of redundant unused capacity. Second, it presumes the ability to transfer petabytes of data.
As a fellow commentard wisely noted, the telecom companies have a dog in this fight. Like the data centers, they are in business to make money. They cannot be expected to build large amounts of excess capacity. Unless, of course, they can charge for it.
Bottom Line: There are four major stakeholders in this issue: The folks building the large data centers; the telecom companies; the government (for whom the infrastructure is strategically vital); and the customer. All but the customer have a strong vested interests in forcing the customer to use and pay for Cloud Computing services.
Finally, about that bandwidth: Shannon's "law" is still alive and well. Many of us have had the experience of getting on the Wifi connection at a hotel, only to watch the number of bars shrink as more guests arrive and log on until eventually, only the guests closest to the Wifi transmitter have the signal to noise to get any quality of service.
Now Imagine that on a global scale.
No need to be sarky. :)
Some people think all the tech in the cloud is redundent, therefore you don't need a DR site.
They don't always know that its rather like using RAID5 instead of a backup.
The problem is the cloud doesn't scale cheaply. When you push the limits of tech, things get expensive. When you add a third party, things get expensive. When you need serious uptime, things get expensive. When you put all your eggs in one basket, outages become expensive.
A third party has no interest in the value of your application uptime. Therefore, the (cost of) tech used is only really going to be vaguely appropriate.
Many world citizens assume that large and recognizable corporations like Adobe will surely employ the best Cyber Security and reliability technologies available. This is certainly no so, since Adobe has no history or experience what-so-ever in Internet networking, Computer security, high availabiliy and reliability and therefore probably give less priotity to such matters which are then automatically reflected in whatever technology reliability and security solutions is engaged.
Don't forget, Adobe is a retail grapgics technology firm, nothing more, irrespective of their wealth. Examine their rens od dozens of Adobe Flash fixes just in the past two to three years.
The only people predicting a rapid take up of the Cloud over the next 2 years are the vendors whom want to give you the impression that Cloud is taking over the world - the truth is quite the opposite. The only winners with for instance Microsoft's Cloud products remain the vendor and the partners / resellers receiving greater incentives. Most businesses (small, medium and large) face increased costs with Cloud over the period of contractual period (compared to perpetual volume) and I would be amazed if Microsoft achieved even 20% Cloud revenue by 2016 given how slow Enterprise customers have been in taking up Office 365 and Azure to date. And you cannot blame businesses for being sceptical - outages and increased costs are just a couple of issues to grapple with - would you want the NSA spying on your company's data?
"Cloud" services are good when they're things you don't need live 100% of the time. Like overnight backups. As long as it works 99% of the time, it's not a big deal.
But for anything you need instant access to at random times, the cloud is not it. Think about how many things can break between your keyboard and the cloud provider's hard disks. Add in the number of people who can cock up a config or damage equipment between you and the cloud provider, and the whole deal looks really stupid.