Re: Is it luck the snapshot survived?
Note: This does not pertain specifically to EBS and it's snapshot system.
Sucks you'll probably never read this but kind of.
It's actually a tree. You make your base image say. Snapshot it - that's gone from (empty drive) to (base image crap) so it contains ALL the data on the drive + snapshot format data describing it, it may be compressed but you get the idea
Okay now you make some changes, and snapshot again - that snapshot says "hey I'm based on this other one" and any read requests will hit it first, if the changes affect those reads it returns data, never querying the base image, otherwise passes to its parent (the base image here) - which may return zeros from its base image if it's out of bounds.
This keeps going on. Yes this /could/ use garbage collection or just "fuck it, a new complete snapshot would be good" if you have layers and layers of reads.
However changes made may be in use on other snapshots with their own active drives.
As has been pointed out, they're stored on S3 which is (probably) much more durable and is designed with very different things in mind than backing a file system (in the generic sense)
This system works at the block level, where you (conceptually) ask a drive for block 123456 or say "write this to block 3471233" - the snapshot system need not have any idea how to actually understand WTF is going on. Indeed it might be of an encrypted drive (probably should be too)
The catch there is encrypted shit is hard to compress. The hard core version has no trivial base layer of 0s either. Not sure if they go that far.
It's an interesting problem because these systems are not just backups with deltas it's common to have a base image, customise it a tiny amount for a bunch of machines, and have them running, then patch the base image, smarter ones are file-system aware.
A block-based one may have meta-data for a specific-server configuration file in the same block as metadata about a file changed by an update to the base image, they must cry bloody-murder here - so typically systems that are not filesystem aware (most, or at least the default behavior of most) instead know how to generate the specific images on-demand. For example "image-specificier %id" might be run with the server's "ID" filled in wherever it sees %id or something. You patch the base image and then it regenerates snapshots from it.
It's really quite interesting but it does split into a bunch of different problems.