But everything's OK.
They had it all backed up to the cloud.
A power outage fried hardware within one of Amazon Web Services' data centers during America's Labor Day weekend, causing some customer data to be lost. When the power went out, and backup generators subsequently failed, some virtual server instances evaporated – and some cloud-hosted volumes were destroyed and had to be …
No. Customers who lost data on their EBS volumes did NOT have it backed up to anywhere – including the cloud. EBS is nothing more than very resilient block storage. Nowhere in its service description or SLA does it imply that you do not need to back it up. In fact, in the service description it blatantly states you can expect to lose 1 or 2 volumes a year if you have 1000 volumes. ... so you should use EBS snapshots and back it up.
Customers that lost data had no backup of their EBS data.
Err, no. Customers who lost data most likely have backups just not real-time ones. If you backed up at midnight and had a failure at mid-day, for example, there will be a portion of your data that is not backed up hence you suffer data loss. It wasn't because you don't have backups though, just the timeliness of them.
I've heard so many times that you can migrate your on-premise servers to the cloud in a more or less 1:1 mapping and let your cloud provider do all the work of maintaining uptime and data integrity.
And yet again we have proof that you still have to put in the effort to ensure you have geographically diversified replication and backups.
Yes, you can migrate your on-premises systems to EC2 in a 1:1 mapping -- as long as one of those servers is the backup server. ;) . (Or, of course, use the cloud equivalent of such.) . These customers migrated but left backup out of the equation.
"Made to believe." Not by the product descriptioin or the SLA, that's for sure. The product description comes right out and says you will lose 1 or 2 volumes a year if you have 1000 volumes -- so backup. The SLA does not imply anything about backup.
So if they were "made to believe" that, they didn't get it from Amazon.
Gonna get some downvotes for this, but if you use EBS, that sort of failure is to be expected. Amazon say if you have 1000 EBS volumes running for a year, you should expect one or two to fail and have to be restored from backup. Those numbers are obviously averages across all AWS DCs, whilst problems tend to be concentrated at particular DCs.
If you put data in there that you absolutely must have restored, you should either use a different storage or take snapshots as regularly as you need them to be. EBS is the equivalent of local disk storage, its not cross-AZ, if you require proper resilience in cloud you should be using something like S3, and design your systems appropriately to be able to use that sort of storage.
No Tom, just stop it!
This thread is for ill informed rants about "the cloud is just someone elses computer". Someone who uses the cloud, understands exactly what this storage is supposed to provide and then points it out is gonna get short shrift.
Given that Cloud is sold to manglements on the basis that it takes away all those complications of dealing with their in-house expert staff and hands it over to people who'll just do the work without arguing those rants seem fully justified.
It is somebody else's computer. When using your own computers you expect someone on your staff to look after them. If you've been persuaded to use somebody else's because it's cheaper you might reasonably expect that somebody else to do the looking after. Anything else smacks of keeping a dog and barking yourself.
"Manglements" should at least know how to read an SLA. That's what they are good at. Someone somewhere should mention that this doesn't include backup, and if they're doing their job then they would double-check that. And if they double-checked it, they would find out they are responsible for backup.
EBS is a very secret system that no one is allowed to understand. The only details AWS will provide about it are its general service and uptime/redundancy, but you aren't allowed to know how it works or how its redundant.
I can't understand why anyone in technology wouldn't want to understand how it works, but all these developers and PHBs seem to be fine with not knowing.
I had a days long discussion with a developer at one point explaining to them that I have never had a storage failure on a raid system in the more than 20 years I have been doing this at many different orgs, some of them fortune 500. When it finally dawned on him that its not normal for businesses to incur data loss due to disk failure, he was shocked (because he was a developer and just thought about things realted to what he knew, like his desktop computer).
No offence to any devs out there...
I also got the impression EBS was more durable than that. By "lost power" do they mean got some nasty massive over-current that actually buggered the drives? I'm not going to check my notes for a comment (I actually care about durability of file systems, I know right, crazy) ... I'm pretty sure I ticked EBS off in the "okay and will work" (not reordering writes up to a point...) don't quote me (I don't use it or support anything that does) but:
The system respects flush commands as barriers (fsync like, so later writes can move before earlier ones but earlier ones can't escape the flush, this is why applications wait) - so WTF?
By "pretty sure" I mean "assured that it happens" by what I've read and heard, even if what I'd like to see isn't in the docs.
Also is it just dumb luck the snapshots were not buggered as well?
EBS is very resilient storage, but it's not fool-proof.
And, no, they were not lucky the snapshot survived. An EBS "snapshot" is actually an image storage as an object on S3, which is an entirely different system. Unlike EBS, S3 is replicated across three AZs, and the replication happens at an object level. EBS is only replicated within an AZ, and it is using block-level replication.
As I read the article they had multiple failures, and it buggered multiple systems simultaneously. 99.5% were able to recovery, and most of the .5% had EBS snapshots so they could recover as well. But some of those did not have snaphots, so they actually lost data.
Note: This does not pertain specifically to EBS and it's snapshot system.
Sucks you'll probably never read this but kind of.
It's actually a tree. You make your base image say. Snapshot it - that's gone from (empty drive) to (base image crap) so it contains ALL the data on the drive + snapshot format data describing it, it may be compressed but you get the idea
Okay now you make some changes, and snapshot again - that snapshot says "hey I'm based on this other one" and any read requests will hit it first, if the changes affect those reads it returns data, never querying the base image, otherwise passes to its parent (the base image here) - which may return zeros from its base image if it's out of bounds.
This keeps going on. Yes this /could/ use garbage collection or just "fuck it, a new complete snapshot would be good" if you have layers and layers of reads.
However changes made may be in use on other snapshots with their own active drives.
As has been pointed out, they're stored on S3 which is (probably) much more durable and is designed with very different things in mind than backing a file system (in the generic sense)
This system works at the block level, where you (conceptually) ask a drive for block 123456 or say "write this to block 3471233" - the snapshot system need not have any idea how to actually understand WTF is going on. Indeed it might be of an encrypted drive (probably should be too)
The catch there is encrypted shit is hard to compress. The hard core version has no trivial base layer of 0s either. Not sure if they go that far.
It's an interesting problem because these systems are not just backups with deltas it's common to have a base image, customise it a tiny amount for a bunch of machines, and have them running, then patch the base image, smarter ones are file-system aware.
A block-based one may have meta-data for a specific-server configuration file in the same block as metadata about a file changed by an update to the base image, they must cry bloody-murder here - so typically systems that are not filesystem aware (most, or at least the default behavior of most) instead know how to generate the specific images on-demand. For example "image-specificier %id" might be run with the server's "ID" filled in wherever it sees %id or something. You patch the base image and then it regenerates snapshots from it.
It's really quite interesting but it does split into a bunch of different problems.
...that's cause you get managers who say "I want to be infrastructure free. It will save us loads of money" and despite being told it won't and there is no such thing as infrastructure free, they want stuff done on the cheap. So think moving to the Cloud means their service will do everything for you backup wise, not understanding you have to set that up yourself and pay for it.
Biting the hand that feeds IT © 1998–2020