Re: Bugs, eh?
> FORMAL METHODS. Allied to CORRECT ARCHITECTURE
Or at least use provably correct algorithms, like the Raft protocol for replication of state.
It's amazing how many vendors stick two head ends on a system and say "this is redundant", on the basis that if A fails then B is there to take over. But every possible scenario of node A and node B running slow or going offline and coming back online *will* eventually happen; and usually the result is split-brain data corruption.
Here they were replicating a block store. Fine, except a filesystem is a data structure which extends over multiple blocks; if you have replicated some blocks but not others at the time you cut over, your data structure is toast. And even if the filesystem itself is in a consistent state, the individual files which your application reads and writes may be in an inconsistent state which is useless if the application were to restart at this point (think: VM image files, database files)
So you need all your applications *and* your OS to generate consistent checkpoints, and replicate only at a checkpoint state. Alternatively, snapshot the entire running system state including RAM (which contains the application state and VFS block cache) and replicate that, which needs to integrate VM and storage layers.
This is much harder than people think; and of course if you get it wrong everything appears to be working tickety-boo, until the day it fails and you really needed it to be right.
If Salesforce are using Oracle under the hood, I'd expect them to switch from block replication to RMAN replication. Each database transaction is either replicated, or it isn't.