Re: @msobkow
To me, ECC alone hasn't been enough for real servers for a long time. I remember reading this more than a decade ago regarding HP's "Advanced ECC"
http://service1.pcconnection.com/PDF/AdvMemoryProtection.pdf
The document is so old they reference generation 2 servers, of which I was deploying back in 2004 (2005 at the latest) maybe?
from the pdf
"To improve memory protection beyond standard ECC, HP introduced Advanced ECC technology in 1996. HP and most other server manufacturers continue to use this solution in industry-standard products. Advanced ECC can correct a multi-bit error that occurs within one DRAM chip; thus, it can correct a complete DRAM chip failure. In Advanced ECC with 4-bit (x4) memory devices, each chip contributes four bits of data to the data word. The four bits from each chip are distributed across four ECC devices (one bit per ECC device), so that an error in one chip could produce up to four separate single-bit errors."
I've always wondered how well Advanced ECC does against these attacks. I have read ECC alone is enough to defeat them as they stand today, but have not noticed if Advanced ECC has any further benefit beyond regular ECC in this security scenario.
IBM has/had a similar technology called ChipKill:
https://en.wikipedia.org/wiki/Chipkill
(update)
Came across a PDF linked in above article from HP:
http://ftp.ext.hp.com//pub/c-products/servers/options/Memory-Config-Recommendations-for-Intel-Xeon-5500-Series-Servers-Rev1.pdf
Which puts things into plainer english
"Note that Advanced ECC is equivalent to 4-bit ChipKill. Lockstep gets us to 8-bit ChipKill. ChipKill just indicates that an entire DRAM chip can die and the server will keep running.
Negatives of Lock Step Mode:
- You have to leave one of the three memory channels on each processor un-populated, so you cut your available number of DIMM slots by 1/3.
- Performance is measurably slower than normal Advanced ECC mode.
- You can only isolate uncorrectable memory errors to a pair of DIMMs (instead of down to a single DIMM)."
I do remember turning on "Advanced ECC" in a Dell server(was happy to see the option appear in the bios at the time this was back in 2010 I think), however was sad to see when it disabled a bunch of the dimm slots, I assume for fault tolerance. HP has a similar option called something like "Online spare memory" where some banks are kept in reserve(on my 384GB systems it lowered addressable memory to 320GB). I don't know any info on Dell's implementation if it was just online spare memory and they called it Advanced ECC or if it was some other approach. And perhaps they have improved it a bunch in the past decade. (update) I am guessing Dell's "Advanced ECC" was Intel Lockstep.
I have been quite surprised that others haven't come up with similar technology (thinking Supermicro and other smaller players). Or perhaps they have and I'm just not aware of it.