"behave in ways that are nothing like their physical datacenter counterparts"
Things are not what they seem.
Chat platform Discord delivered a playful slap to Google yesterday with a post describing how the company dealt with "reliability issues" to achieve some impressively low latency. Discord deals with 4 billion messages sent through the platform per day by its millions of users. The company runs a set of NoSQL database clusters …
Things are not what they seem.
Not sure where the downvotes came from. Makes sense to me.
Quite a few service desk calls are things that users are unable to do that turn out to be matters of config or product understanding. Meaning they are "fixed" without any underlying changes.
Only last week, I worked on a "problem" that arose from a subtle misunderstanding about what a display was showing. Once it was understood, no further action was needed. So an issue, not a problem.
The most important thing was to record the incident so successors and other colleagues also understand it.
Back in the day when 4GB memory was about £4k, we would raid 10 our data on live services because a good controller would allow writing to one of the pair while reading the other and because reads do not need use any locks would be extremely fast. This was more expensive than RAID5 or 3 or 4 but much more resilient, allowed for different batches of disks in a mirror pair and was a magnitude faster at replacing a failed disk.
Well, the article isn't particularly clear, but I suspect the servers were on GCP, so the SAN options are limited to what Google offered.
It does seem the team at Discord had a hard lesson in fundamentals of design for transactional performance, with solution design being made more difficult by the use of cloud.
Because they auto-remap the bad sectors behind your back.
Now the remapped sector has to have an additional lookup command adding latency.
You can run out of the spare sectors that they hide from you as well.
This is one reason that disk wipe software had to develop special methods dealing with SSDs, because wiping all *active* sectors doesn't wipe *all* data off the disk.
Not particularly relevant to the tech stack in question, but remapping around bad sectors is a very old problem and an SSD is no different. NTFS for instance has had a feature since NT 3.5 era where if there is a write error, it will mark the sector as bad in its internal tables, remap the LBA to another free block and perform the write again. Doesn't help for read errors obviously.. .but its one of those little things that if you start to see this happening in the event log... you got a disk on borrowed time. Then came SMART which basically does the same thing at the hardware level, again if you notice these events, your disk is on borrowed time. These behaviors are great... right until they aren't since its easy to ignore/never see the events that tell you of the pending failure.
Every HDD that's failed on me has done so with warning enough - slowdowns and noisiness - that I've managed to back up and migrate to a new drive in time.
Meanwhile every SSD that's failed has done so catastrophically. Though in one case it was just the boot sector of the SSD, that was bad enough.