back to article AWS celebrates Labor Day weekend by roasting customer data in US-East-1 BBQ

A power outage fried hardware within one of Amazon Web Services' data centers during America's Labor Day weekend, causing some customer data to be lost. When the power went out, and backup generators subsequently failed, some virtual server instances evaporated – and some cloud-hosted volumes were destroyed and had to be …

  1. jake Silver badge

    But everything's OK.

    They had it all backed up to the cloud.

    Right? RIGHT???

    1. Nolveys

      Re: But everything's OK.

      The "lost" data are still in the cloud. It's just that the cloud is black, made of smoke and is floating around a data centre somewhere.

    2. lglethal Silver badge
      Trollface

      Re: But everything's OK.

      But but...CLOUD!! *mumble mumble* Something something redundancy... something something backup... something something disaster recovery.... CLOUD!!

    3. wcpreston

      Re: But everything's OK.

      No. Customers who lost data on their EBS volumes did NOT have it backed up to anywhere – including the cloud. EBS is nothing more than very resilient block storage. Nowhere in its service description or SLA does it imply that you do not need to back it up. In fact, in the service description it blatantly states you can expect to lose 1 or 2 volumes a year if you have 1000 volumes. ... so you should use EBS snapshots and back it up.

      Customers that lost data had no backup of their EBS data.

      1. Mark 65

        Re: But everything's OK.

        Err, no. Customers who lost data most likely have backups just not real-time ones. If you backed up at midnight and had a failure at mid-day, for example, there will be a portion of your data that is not backed up hence you suffer data loss. It wasn't because you don't have backups though, just the timeliness of them.

        1. Anonymous Coward
          Anonymous Coward

          Re: But everything's OK.

          That is not accurate, there was real data loss: https://www.bleepingcomputer.com/news/technology/amazon-aws-outage-shows-data-in-the-cloud-is-not-always-safe/

  2. Ian Michael Gumby
    Boffin

    Outch!

    Hate to be the guy who has to tell his non-technical boss that going to the clouds just cost them a bunch of data that they can't recover.

    1. jake Silver badge

      Re: Outch!

      Or perhaps he could say "I told you so!". Has worked for me several times, albeit as a consultant, not an employee. It's sweet. Very sweet.

  3. Anonymous Coward
    Anonymous Coward

    What's Corey going to do?

    Can't wait for @quinnypig to awkwardly hyperventilate "Well, at least 99.5% survived".

  4. Anonymous Coward
    Anonymous Coward

    Idiots!

    We all know that this kind of thing can only happen if you use the cloud.

    1. Anonymous Coward
      Anonymous Coward

      Re: Idiots!

      Yes, because no ones ever had a disk subsystem fried by a power issue in the history of the world before cloud.

      1. EH

        Re: Idiots!

        r/woosh ?

  5. Paul

    Using the cloud doesn't absolve you of the need to design your platform

    I've heard so many times that you can migrate your on-premise servers to the cloud in a more or less 1:1 mapping and let your cloud provider do all the work of maintaining uptime and data integrity.

    And yet again we have proof that you still have to put in the effort to ensure you have geographically diversified replication and backups.

    1. wcpreston

      Re: Using the cloud doesn't absolve you of the need to design your platform

      Yes, you can migrate your on-premises systems to EC2 in a 1:1 mapping -- as long as one of those servers is the backup server. ;) . (Or, of course, use the cloud equivalent of such.) . These customers migrated but left backup out of the equation.

      1. Olivier2553 Silver badge

        Re: Using the cloud doesn't absolve you of the need to design your platform

        Or maybe they migrated because they had no backup to start with and were made to believe that now they would not need to even consider having any.

        1. wcpreston

          Re: Using the cloud doesn't absolve you of the need to design your platform

          "Made to believe." Not by the product descriptioin or the SLA, that's for sure. The product description comes right out and says you will lose 1 or 2 volumes a year if you have 1000 volumes -- so backup. The SLA does not imply anything about backup.

          So if they were "made to believe" that, they didn't get it from Amazon.

  6. JacobZ
    Joke

    Convenience of the cloud

    In the old days we all had to be constantly alert for the possibility of power loss, network outages, server failures, and other physical disasters.

    Nowadays you can pay a Cloud vendor to provide them for you.

  7. Tom 38 Silver badge

    Gonna get some downvotes for this, but if you use EBS, that sort of failure is to be expected. Amazon say if you have 1000 EBS volumes running for a year, you should expect one or two to fail and have to be restored from backup. Those numbers are obviously averages across all AWS DCs, whilst problems tend to be concentrated at particular DCs.

    If you put data in there that you absolutely must have restored, you should either use a different storage or take snapshots as regularly as you need them to be. EBS is the equivalent of local disk storage, its not cross-AZ, if you require proper resilience in cloud you should be using something like S3, and design your systems appropriately to be able to use that sort of storage.

    1. sabroni Silver badge

      re: if you use EBS, that sort of failure is to be expected.

      No Tom, just stop it!

      This thread is for ill informed rants about "the cloud is just someone elses computer". Someone who uses the cloud, understands exactly what this storage is supposed to provide and then points it out is gonna get short shrift.

      1. Doctor Syntax Silver badge

        Re: re: if you use EBS, that sort of failure is to be expected.

        Given that Cloud is sold to manglements on the basis that it takes away all those complications of dealing with their in-house expert staff and hands it over to people who'll just do the work without arguing those rants seem fully justified.

        It is somebody else's computer. When using your own computers you expect someone on your staff to look after them. If you've been persuaded to use somebody else's because it's cheaper you might reasonably expect that somebody else to do the looking after. Anything else smacks of keeping a dog and barking yourself.

        1. wcpreston

          Re: re: if you use EBS, that sort of failure is to be expected.

          "Manglements" should at least know how to read an SLA. That's what they are good at. Someone somewhere should mention that this doesn't include backup, and if they're doing their job then they would double-check that. And if they double-checked it, they would find out they are responsible for backup.

          1. Anonymous Coward
            Anonymous Coward

            Re: re: if you use EBS, that sort of failure is to be expected.

            Manglements *should* know how to do a lot of things.

            Reality differs.

          2. DJV Silver badge

            Re: re: if you use EBS, that sort of failure is to be expected.

            ""Manglements" should at least know how to read"

            Stopping right there explains many of the problems.

    2. Blane Bramble

      This sounds like EBS failed and wasn't even locally redundant though. That is a much bigger problem if true.

      1. Mr.Nobody

        EBS is a very secret system that no one is allowed to understand. The only details AWS will provide about it are its general service and uptime/redundancy, but you aren't allowed to know how it works or how its redundant.

        I can't understand why anyone in technology wouldn't want to understand how it works, but all these developers and PHBs seem to be fine with not knowing.

        I had a days long discussion with a developer at one point explaining to them that I have never had a storage failure on a raid system in the more than 20 years I have been doing this at many different orgs, some of them fortune 500. When it finally dawned on him that its not normal for businesses to incur data loss due to disk failure, he was shocked (because he was a developer and just thought about things realted to what he knew, like his desktop computer).

        No offence to any devs out there...

  8. Doctor Syntax Silver badge

    It gives "Availability Zone" a whole new layer of meaning.

  9. Anonymous Coward
    Anonymous Coward

    EC2 backups and replication are easy and cheap to backup. it has been around for more than 3 years. Nakivo makes a great ec2 B&R tool.

    A customer not Nakivo employee.

  10. ATeal

    Is it luck the snapshot survived?

    I also got the impression EBS was more durable than that. By "lost power" do they mean got some nasty massive over-current that actually buggered the drives? I'm not going to check my notes for a comment (I actually care about durability of file systems, I know right, crazy) ... I'm pretty sure I ticked EBS off in the "okay and will work" (not reordering writes up to a point...) don't quote me (I don't use it or support anything that does) but:

    The system respects flush commands as barriers (fsync like, so later writes can move before earlier ones but earlier ones can't escape the flush, this is why applications wait) - so WTF?

    By "pretty sure" I mean "assured that it happens" by what I've read and heard, even if what I'd like to see isn't in the docs.

    Also is it just dumb luck the snapshots were not buggered as well?

    1. wcpreston

      Re: Is it luck the snapshot survived?

      EBS is very resilient storage, but it's not fool-proof.

      And, no, they were not lucky the snapshot survived. An EBS "snapshot" is actually an image storage as an object on S3, which is an entirely different system. Unlike EBS, S3 is replicated across three AZs, and the replication happens at an object level. EBS is only replicated within an AZ, and it is using block-level replication.

      As I read the article they had multiple failures, and it buggered multiple systems simultaneously. 99.5% were able to recovery, and most of the .5% had EBS snapshots so they could recover as well. But some of those did not have snaphots, so they actually lost data.

      1. highdiver_2000

        Re: Is it luck the snapshot survived?

        I have always thought a snapshot is a delta from a full backup.

        1. GreenReaper

          Re: Is it luck the snapshot survived?

          No, it is the whole thing at a snapshot in time - however... if you have multiple snapshots, you may calculate the difference between them and then back that up and restore it on a server that had the original snapshot; then it'll have both.

        2. ATeal

          Re: Is it luck the snapshot survived?

          Note: This does not pertain specifically to EBS and it's snapshot system.

          ---------------------------

          Sucks you'll probably never read this but kind of.

          It's actually a tree. You make your base image say. Snapshot it - that's gone from (empty drive) to (base image crap) so it contains ALL the data on the drive + snapshot format data describing it, it may be compressed but you get the idea

          Okay now you make some changes, and snapshot again - that snapshot says "hey I'm based on this other one" and any read requests will hit it first, if the changes affect those reads it returns data, never querying the base image, otherwise passes to its parent (the base image here) - which may return zeros from its base image if it's out of bounds.

          This keeps going on. Yes this /could/ use garbage collection or just "fuck it, a new complete snapshot would be good" if you have layers and layers of reads.

          However changes made may be in use on other snapshots with their own active drives.

          As has been pointed out, they're stored on S3 which is (probably) much more durable and is designed with very different things in mind than backing a file system (in the generic sense)

          This system works at the block level, where you (conceptually) ask a drive for block 123456 or say "write this to block 3471233" - the snapshot system need not have any idea how to actually understand WTF is going on. Indeed it might be of an encrypted drive (probably should be too)

          The catch there is encrypted shit is hard to compress. The hard core version has no trivial base layer of 0s either. Not sure if they go that far.

          It's an interesting problem because these systems are not just backups with deltas it's common to have a base image, customise it a tiny amount for a bunch of machines, and have them running, then patch the base image, smarter ones are file-system aware.

          A block-based one may have meta-data for a specific-server configuration file in the same block as metadata about a file changed by an update to the base image, they must cry bloody-murder here - so typically systems that are not filesystem aware (most, or at least the default behavior of most) instead know how to generate the specific images on-demand. For example "image-specificier %id" might be run with the server's "ID" filled in wherever it sees %id or something. You patch the base image and then it regenerates snapshots from it.

          It's really quite interesting but it does split into a bunch of different problems.

  11. IGnatius T Foobar ! Bronze badge

    Get out

    Just one more example of why one company shouldn't have so much of the IT world in its own data centers. Get out of Amazon and find a smaller cloud provider. Diversity is what keeps the Internet reliable. Too much in one place, and you have this kind of problem.

  12. steviebuk Silver badge

    Well then...

    ...that's cause you get managers who say "I want to be infrastructure free. It will save us loads of money" and despite being told it won't and there is no such thing as infrastructure free, they want stuff done on the cheap. So think moving to the Cloud means their service will do everything for you backup wise, not understanding you have to set that up yourself and pay for it.

    1. Anonymous Coward
      Anonymous Coward

      Re: Well then...

      Are you sure you didn't mean to type "[they] want to be serverless, it will save [...] lots of money"

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2020