Chalk one up...
... to cosmic rays.... :-/
Hosting outfit Gandi has published its postmortem regarding this month's outage and concluded that while it still has "no clear explanation", the main problem was "the duration". So that's OK then. The mystery incident took down a storage unit in the company's Luxembourg facility at 14:51 UTC on 8 January. It wasn't until 13 …
those pointing smugly at their own storage and hosting setups would do well to take a careful look at Gandi's experience.
Very well said. Hardware and software mostly do a very good job of abstracting and hiding the complexity behind the scenes, but who among us can honestly say they've never gone down a diagnostic rabbit hole and quickly come up against vast swathes of components, libs, interfaces, interactions, config options and mechanisms, such that Google can only show you how much you don't know? Who here has complete docs for every aspect of the systems or apps they're responsible for? (Yes that includes you Devs!)
Oh come now, it's far worse than that..
The joys of using java + spring via maven and then... Start adding some JavaScript library/framework de jour with some node build tool which routes through a redis instance or 5 and then slap it into Docker and run from AWS.
Always feels like each and every part of the whole house of cards is usually on fire if not smoldering at the very least... Fun.
Welcome to the modern world of serious websites with a hell of a legacy behind it.
Anonymous because I'm not getting drawn into just how bad this can get.
IIRC dedupe has a certain risk of hash "collisions". Most systems do (but I assume you mathematically put those things where it's near impossible a user/computer would want that type of data structure, like 100% zeros, or the Star Wars prequel trilogy).
ZFS is probably the same?
ZFS allows you to just check the hashes or do a full block verification. If you are using SHA-2-256 the chances of a hash collision are very very very small. According to the author 50 times less likely than an undetected & uncorrected ECC memory error.
https://blogs.oracle.com/bonwick/zfs-deduplication-v2
I'd more likely suspect multiple disk failures after a power outage forcing recovery from backup when the procedure hasn't been tested recently.