Don't forget Joyent!
Joyent suffered a very similar thing in jan and that lasted four days.
No icon because... well... my website's down and I don't feel like it.
Another large cloud is on the fritz. Following last month's much-discussed Amazon S3 outage, most (if not all) of XCalibre's FlexiScale cloud went dark on Tuesday, and nearly two days later, the UK-based hosting outfit has yet to restore service. According to XCalibre CEO Tony Lucas, the outage has affected "a vast majority" …
I feel for the poor bastard that hosed that data.
# vxdiskadm
....
"Hmmm, this command's taking a bit long to return... Ah, cool, never mind. We're back."
<phone rings>
"Hello. File not found, eh? I'll take a look."
<some minutes later>
"Uhh guys. What's the difference between LUN0PRD and LUN0DEV?"
<cold sweat>
<P45>
*right click* Lun 30
*delete Lun*
are you sure? yes
*wait*
~~~ ring ring
"Hi - support"
"Hi - we seem to have a problem accessing site a"
"Okay wait a second... hmm odd, just wait a moment... Hi could I call you back"
*looks at navisphere window - notes that that is not Lun 30 that has just been vaped, it is infact Lun 31*
*talk to backup guy - backup guy goes lulz nobody back dat shiz up. *
you sigh.
In a badly organised architecture it's a piece of piss to fuck it all up.
Of course if dev/sys were on a completly different system, with different access cred, with different boiler plates, and live luns/disk were clearly labeled and kept well apart. Then sure you shouldn't be able to mess up but my bets is that it all looked more or less the same, with the same passwords, same layout and the guy thought he was in a; but was in b: and it all went to hell.
this is why non live and live are never, ever, in the same place, and why you should never have both live and non live systems open at the same time (unless absolutely necessary.)
Also why you should never leave yourself logged into a live system any longer the absolutely positively necessary.
"exactly HOW does said "XCalibre engineer accidentally delete one of FlexiScale's main storage volumes."?"
How exactly does a customer come to rely on a stroage facility that doesn't have adequate back-up without making sure it has adequate back-up itself?
You just sign up with the first cloud that forms overhead, or what?
This post has been deleted by its author
Thing is you don't necessary get a pink slip (everyone f--ks up, it's a fact of life, just like death and taxes) however a company should have things in place so that if(when) someone does f--k up you can recover.
Why didn't they have replicas, why weren't they running on a decent bit of storage kit, why don't they have some kind of backup. Why didn't they have a plan for "O s--t I f---ed up!" Why wasn't their storage divided up so if you screw c you don't screw a>b d>z?
Becouse of all the disasters that can happen, someone making a mistake is the most likely.
There are a vast number of ways that I could accidently f--k up. A cp -rf where i miss a 1 from a directory number, a rm -rf when I'm in the wrong directory, a restore clone job on the wrong drive, a shutdown immediate in the wrong database, a shutdown -r now on the wrong server, a bog standard operation that on server a means nothing but on server b means omfg. Someone accidently unplugging the wrong network cable,
The thing is, if I or anyone else in my team does that we have ample ways to restore, recover, and minimise any downtime. While in 7 years of trading such things have only happend 4 times, once accidently copying over development database we had just finished, the second accidently restarting the live database, the third accidently rm -rf the live webapp directory and last accidently dislodging a network cable.
All four of them were quickly repaired, becouse at the end of the day we're ready for stupdy f--- ups. As everyone should be.
People make stupid mistakes sometimes, but if they (or their teams) have planned around that fact then no mistake should be critical. Hell I could accidently take out our live database server but we'd be back up within an hour. I could take out our fiber and we'd be back up in 4 hours (we can't afford two sans!!).
Clouds are fluffy and nice from a distance, but if you get close up you just end up all wet. Much like those who use "cloud" computing today. Frankly, it's daft to rely on a historically unstable connection to the internet to access mission critical data being managed by people who won't be held financially responsible for any errors.
Until not only the companies providing the service but also the the executives of these organizations can be held personally responsible for any and all loss of business and extra costs resulting from errors on the part of the "cloud" provider, I won't be recommending any of them to my clients. Just too damn risky.
Bill, because it's his fault that companies get away with not being responsible for shit.
The fact that the man in charge actually stood up and said "mea culpa" impresses me; most outfits I've dealt with would deny all knowledge of a problem, and when it was finally nailed to their doorstep would then promptly blame "a freak once in a lifetime string of hardware failures".
As for redundant systems, yes, I fully agree they should have had one in place; but if you're all bidding to be the cheapest supplier I wouldn't be surprised to find that they didn't... That's precisely why outsourcing's usually cheaper, on paper ;)
If i was a customer and this happened, an email with actual details of the problem and telling me the truth and taking responsibility would give me alot more faith in a company.
To people who are asking about this guy getting sacked, he made a mistake and backup procedures mean they are restoring the data to another volume.
Its 4 days because i expect it is huge amount of data to restore.
I can think of other companies who might have just said tough, your data's gone.
These guys made a mistake and they are doing exactly the right thing in response.
Cloud computing is new, f**k ups happen. Give it time to mature and i am sure it could be good.
...you takes your chances!
OK you'd hope and expect good backup policies (and it sounds like a full restore from backups is what they're doing) but unless you're willing to pay top dollar don't expect redundant equipment, redundant storage etc. These things don't come cheap especially when you add up all the extra costs on and above capital investment ie maintenance, support, power, cooling etc.
That's why certain companies aim at the Enterprise end of the spectrum and others aim at the Mom'n'Pop end. It's hard to be all things to all people! You either cost too much for Mom'n'Pop or don't have the redundancy and support for Mr Enterprise.
Big up to Tony Lucas though for standing up like man and taking it on the chin! All to rare in todays Modern world.
Ha. Ha. Ha.
We had a customer leave us recently and move to this, citing that we couldn't possibly offer a better service (and we had had a couple issues, mainly due to a particular DC provider, who is no longer on our roster!).
Makes my day because we told them this, but naturally were ignored.
(Agree with the honesty of the Tony Lucas though - better to tell the truth)
Sigh, outsource, privatise. It's all about saving money. Effective delivery gets forgotten about. So hospital wards are dirty because Ward Managers have no authority over the privately employed cleaners etc, etc. You always get what you pay for, why then is it always such a surprise?
This country will continue to head downwards unless and until this lesson is learnt. Icon because we are still going to crash and burn.
Actually, complexity, organisation etc has nothing to do with it, nor does experience. There is a thing known as 'perceived risk' which means that in the best run outfits, everyone assumes that nothing can go wrong, and everything can be recovered from, But Murphy says no:there is no system so bulletproof that mis-reading a digit cannot fuck up entirely. On systems where everyone knows that there are no safe-guards, people are inevitably much more carefull.
It has its place and of course it will be obsolete for mass data storage soon, but hey there is always a need to pop data out there, run the odd test application deploy more computing power for events etc.
The thing I don't think the IT sector gets is that as soon as fibre is in place all the servers are coming home, never mind the cloud. You have to be seriously demented not to want your computer systems as close as possible to their centre of operations, it is much easier to fix your own computer system than have to do it over a phone with some bod in a data centre.
When developing software my main concern is the distant server, that is the weak link in any system, currently it is not too bad if you can build redundancy into the system, and the clouds can help with that as well, but more often another dedicated server is a better option than cloud space, primarily because you can configure it all better and security is tighter.
And as to less jobs because of the cloud, yeah right, they will try, but actually they are offering something more complex than racks of servers, and it will require more man power to keep it all running. So, I expect the market to contract sharply on the cloud suppliers, we may only have 3 or 4 major players with that concept. And most of those will be hunting the events accounts, so adhoc ohhhh we need a load of computing power but for fairly small time frame sort of affairs.
And dedicated servers will always be around as well, it makes sense to host in various countries, and use your local servers as fail safes or traffic directors, that really takes the edge off having to worry about the distant server, it is just being used as a convenience, and adding stability into the mix. So, come on lets get the price of fibre down last I looked it was 68 pence a meter for OM1, which is not too bad.
"...XCalibre will soon distribute its architecture across multiple data centers. "So, if something like this were to happen again, customers could fall over to an other data centers," Lucas says. But a second data center isn't due to open until January.
Yes, Lucas says, it would have been nice to have a second data center in place back in October, but, well, funds were tight..."
maybe i'm being dumb here, but i thought the whole point of 'cloud computing' was that your files were spread over multiple data centres anyway - ensuring that if one went 'tits up' your data could still be retrieved from others.
without that element of built-in redundancy, i fail to see the difference between cloud computing and just renting some FTP space off any old common or garden web host, as we've all been doing up to now.
buzzwords eh? - dontcha love 'em!
You can follow the Twit-ter at: http://twitter.com/tonylucas
It's full of positives and you'd think the restoration is all going smoothly. Only it's not.
Very frustrated customer, who's *still* waiting on access 2 days after the supposed 'magic bullet' of new storage arrived.
Paris, because clueless deserves company.
Anonymous Coward,
If you would like to drop me an e-mail (or a note via Twitter) I'll gladly give you an update specifically in regards to your servers.
The vast majority of customers are back up and running without problems, but we are still sorting through the rest as fast as possible, and yes, some will still be coming online tommorow I would imagine.
Accidents like this happens. For example, a sys admin wanted to make a change to a SQL Server but accidentally selects 300 production servers. Despite two warnings, which appear repeatedly whether one or multiple servers is selected, the sys admin does the change. A week later, the business returns to service and the sys admin is fired in the process.
Businesses like this really needs to establish a thorough risk management plan with contingencies, which should include "two keys required to fire" for any configuration management changes. Unfortunately, technical tools do not lend itself to a hierarchical command approval process ... everything utimately depends on a single person to be wide awake, focused, and 100% dedicated to the task and this same person is God in the system. The former is an unrealistic work environment, but yet the later occurs all the time in IT.