Thank you DDR5
Being ECC by default means we can move on from this and finally put it behind where it deserves.
If the next version of the Linux kernel emerges a little slower than usual, blame a dodgy DIMM in Linus Torvalds's AMD Threadripper-powered PC and the vagaries of the memory market. In a post responding to a kernel developer inquiring if he had missed a Git Pull, Torvalds on Sunday revealed the request was still in his queue …
DDR5 only has on-die ECC, and there is no ECC when the data is being send to the CPU.
It's not ECC in the way it is understood for actual ECC ram.
Better then no ECC of course, but not ECC as most people will understand it.
Luckily AMD seemed to have no issues if you use ECC ram, as long as your motherboard supports it. So you can use ECC on regular computing gear, if you get a suitable motherboard with your AMD CPU.
Intel still thinks it's only used for servers and expensive workstations.
Good to hear, though not all AMD CPUs do it seems:
See the note "For Ryzen Series CPUs (Picasso and Raven Ridge), ECC is only supported with PRO CPUs." so while better than Intel, it is still a shame you have to go to such lengths to get workable ECC systems.
I haven't researched this, but the specific codenames imply that the chips that didn't support ECC were the Zen 1 and 2 range APUs. If that's true, then it's less that AMD also carved out a set but rather that they've only added it recently. It appears that all Zen 3 or higher APUs should support it and for some reason, Zen 1-2 chips without integrated graphics do as well. Maybe they had a problem getting ECC support along with the integrated GPU when those were newer.
Before buying things based on my hypothesis, check more thoroughly than I did because I might be proving my ignorance.
Asrock does support ECC throughout with AM4. With AM5 they just took the ECC language off the website though it’s still in the English language manual pdf.
I’m still hoping they will support it and bring this back - and, they may have decided it wasn’t worth it as a differentiator.
Indeed DDR5 on chip ECC is not proper ECC its just to make sure that due to have DDR5 works electrically that an error isn't introduced during bank swaps and other normal ram operations due to smaller processes these chips are produced on and the speed they run at.
Only end-to-end checking of data between the CPU and data written to memory is ECC.
For the chap wondering which AMD boards have ECC support here's the ones I know about:
Gigabyte Aorus X570 Pro, Asrock Rack X470D4U, Asrock Rack X570D4U.
As a data point, the ASRock B550M Pro4 also supports ECC.
Am using one as my desktop, with a Ryzen 5600X and 4x Kingston KSM26ED8/16HD 16GB ram sticks.
Note that the ram sticks in my system are running fine at 3200MT/s, and
Error Correction Type: Multi-bit ECC
Total Width: 72 bits
Data Width: 64 bits
Worth noting that all recent AMD CPUs support ECC except Ryzen CPUs with built in graphics (eg Ryzen 5600G). Which is slightly annoying, as otherwise they'd be a good choice for a home NAS.
(As I understand it, all the new Zen4 CPUs will have on board graphics, but will still support full ECC. DDR5 does also come in a full ECC version, as well as the normal stuff having 'sort-of' ECC).
So, anyone want to buy 16GB of DDR4 ECC? Never used (except for a memtest run).
Worth noting that all recent AMD CPUs support ECC except Ryzen CPUs with built in graphics (eg Ryzen 5600G).
My Ryzen 7 Pro 4750G supports ECC, that's why I chose it. Perhaps it's the "Pro" bit that makes it different, or maybe it's not quite as recent as you meant?
The motherboard is an ASUSTeK PRIME A520M-A, by the way.
This seems to show that there is a bit of a single point of failure by having one (albeit very dedicated and thorough) person to have the ultimate say over something so important.
I hope that the kernel developers and Linus work to get a more resilient system without sacrificing the care with which the development is completed these days.
>>This seems to show that there is a bit of a single point of failure by having one... person to have the ultimate say over something so important
I called it a while ago.... The future is Poettering..... Us penguinistas better learn to love Microsoft and their star employee.
Once Linus is out of the picture, for whatever reason, watch, in raptures of delight, whilst Windows & Linux are morphed into one jack of all trades master of none OS. Systemd is just the start....
/S <------- just in case anyone is under any illusion
Lennart certainly has a lot of the same similar personality attributes that Linus has - which people either love or hate. When he was in Sydney 2007 for linux.conf.au my son and I sat down with him for a lunchtime meat pie if I recall correctly.
The buck eventually has to stop with SOMEONE. And hopefully someone competent in the matter.
Decision-by-committee is the worst death by a thousand cuts.
However, you do know you're teh awsum when your laptop develops a problem and it's worldwide news.
>>Decision-by-committee is the worst death by a thousand cuts
Sounds like you have been subjected to one too many "product planning" or "customer focus" meetings, as have I. Herding Jell-O cats is easier than getting a firm decision by a group of people with conflicting interests.
"This seems to show that there is a bit of a single point of failure by having one (albeit very dedicated and thorough) person to have the ultimate say over something so important."
It's been addressed. Over two decades ago, on segfault.org (February 23rd, 2000) ... and more recently (and seriously) every couple of years ever since. Seems that kids today don't learn history, and wouldn't believe it if they did, so it gets re-hashed ad nauseum.
Use the search engine of your choice and look up "What happens if Linus gets hit by a bus?".
The code's open. People will be compiling it all over the place in a matter of days. Would compiling and testing on a compromised server really do anything? I think not, but even if it would, just using a cloud machine doesn't mean there's anything wrong with it. The classic reason to distrust the cloud is the provider having access to the private data, but as this is open source code, there isn't private data to be had.
To implement an ECC memory system provided the processor itself supports it; I have done plenty of them.
What needs to be done at motherboard level is to expose the ECC signals (for classic ECC) from the processor to the memory devices. ECC is just an extra memory device as the actual ECC (again, classic ECC) is done in the memory controller. Correct one, detect two is the most common form. The extra memory device stores the ECC syndrome bits.
It is only a few connections (10 or 11 depending on details and version of the interface) and does not add significant cost to the design but Intel et. al. want people to think it is somehow magical and expensive - it isn't.
There is a type of embedded ECC (I have seen it in SRAMs) where the checks are done within the device. The testing for this is often done at LANSCE as the majority of problems are due to free neutrons and there is a neutron beam facility to see just how robust the solution is.
Devices that start to fail within a short time (which is a few years) are rarely due to age, but more likely the device has heated up significantly (to the point that the refresh rate has to be increased - provided that has been enabled which it probably is not).
Excessive heating of semiconductor devices shortens the life of them quite significantly - the rule of thumb is the failure rate can double for each temperature rise of 10C (see the Arrhenius_equation).
L1 parity and L2/L3/main memory ECC is required in flight safety critical avionics which is why Intel parts are not used in those applications.
So in this case an unremovable alert “memory chip broken” would have been all he needed.
Haven’t had this problem in 30 years. Before that, as a software developer: App crashed if the largest source file was compiled on one particular machine in the afternoon. After that: 4 out of 7 machines with third party RAM crashed quite reliably if you went for a coffee and stayed away for more than 15 minutes.
It really wasn't that long ago when ECC could be had on any old consumer board. I had a P3 system which worked happily with it, and I'm sure I had various K7's that would take them too.
Workstation pricing and organisations in money-no-object territory have created the rift that makes ECC "rare" for no really good reason.
His post also mentions that his main PC was set up for error correction code memory (ECC memory), but "during the early days of COVID when there wasn't any ECC memory available at any sane prices. And then I never got around to fixing it, until I had to detect errors the hard way."
From the article.
This is always a good idea for system RAM, and new computer builds in general. Helps get past the "early failure" and "overly heat sensitive" chips faster so you know you're more likely to have reliable memory.
ECC doesn't save you from bad RAM, it helps detect it, and occasionally can fix for long enough to complete tasks before having to replace the RAM.
But only if you use an Intel W680 workstation (I.e. expensive) chipset, e.g. in the HP Z2 Mini G9.
Making ECC a Xeon-only feature was a classic case of market segmentation by a monopolist to allow them to extract maximum profits from enterprise customers willing to pay more for reliability.
The problem is that this doesn't solve the segmentation they created at all.
All W680 motherboards I've found so far have a request a quote button instead of a price. This tells me "not for consumers" and "if you have to ask, it's probably too expensive".
The listings i can find, peg the currently available ones at about $450. These boards definitely do not have feature parity with available consumer boards in this price range. This tells me Intel is still asking a crazy price for a chipset that is fundamentally the same as the consumer counterpart. The stat sheets list only vPro and ECC support as differences.
Consumers will be looking for Z690 boards, so manufacturers are unlikely to make W680 boards as it will directly influence their product discoverability.
Unless they just enable ECC support for consumer chipsets, it doesn't matter that they pat themselves on the back about this, because it's just inconsequential. There's still market segmentation, and in the end very few consumers will be able to use this feature.
I understand the price increase for Intel vPro, but for ECC it's just ridiculous.
I'm operating mostly from my elderly and not so great human memory here. But I think the background on ECC is rather complex. The original IBM PC (1981) memory had 9 bit bytes -- 8 data bits and a parity bit. So did its clones and all the subsequent PCs that became common in the late 1980s. And they really needed parity because memory wasn't all that reliable back then. And early PCs didn't use a lot of it. 640K -- all Intel could easily access -- was a lot in the early 1980s. After a while people found ways to use more. But not a lot more. Which was good because memory was expensive. I'm pretty sure that I recall commodity 1MB "DIMMs" (I think we called their predecessors something else back then) going for $100 a MB back in the mid 1990s.
In the 1990s it became obvious to "The Industry" that GUI was the future and that GUI OSes were going to need LOTS of memory. So they looked for corners to cut in order to keep PCs affordable. One of those corners was memory parity. Get rid of that ninth bit and we can get more data bits on a chip. And make a bit more profit. And maybe even make the products a bit cheaper for the consumer. And for those few who REALLY care about memory integrity we'll do something more sophisticated than parity. Hamming Codes. If you're curious, here's a link to an article I originally wrote for the Compuserve Hardware Forum about three decades ago. http://donaldkenney.x10.mx/GLOSSARY/ECCMEMOR.HTM. Back in the 1990s, many gave credit/blame for the disappearance of parity in consumer PC memory to Microsoft. Maybe they were right.
PC hardware and software in the 1990s often didn't work all that well even on good days. I personally think lousy non-parity memory was part of the problem. But there were many other issues. So we don't have parity in consumer PCs nowadays. And AFAICS, no one knows (or much cares) if that's a significant problem.
> 640K -- all Intel could easily access -- was a lot in the early 1980s.
*cough* 'Intel' (i.e. 16-bit x86) can easily access 1mb of address space.
The decision to make only 640kb of that general-purpose RAM was IBM's; the rest is reserved for video card ROM and RAM, the BIOS ROM, etc.
If it was that early then it would likely have been EGA or even CGA video. There was something before that but it's lost to my memory, and probably just as well. Moving the ROM or RAM really depended in whether or not the card manufacturer supported it and if you had the patience to play the jumper game (I probably still have oodles of spare jumpers) and manage to get whatever cards were in a system somehow cooperating on IRQs and memory space.
From my understanding, ECC memory really becomes important/proper useful in consumer situations when the PC stays turned on or on S<=3 standby for long periods of time, and with large amounts/density of RAM (with a crossover somewhere between 16 and 32GB of RAM, esp. with sticks of >= 16GB); switching off/rebooting a PC on the regular shields you from the grunt of random bit-flips (as long as your sticks are not having issues). Admittedly, this seems to be the case currently (from my experience being the IT guy at my work, most people will let their computers suspend to RAM, or just leave them on continuously ¬¬).
Fully agree with you that dodgy RAM could've been a big part of software instabilities in the 90's (or at least, much bigger than given credit to), but I see it more as a QC/field failure of sticks rather than "this wouldn't have happened with ECC/parity".
PD: I will admit I used to S3 suspend my machine for a couple of years, but nvidia drivers on GNU/Linux, plus switching to an SSD, made me stop
I was running servers 20 years ago with linux with 3 to 4G of ram and all of it ECC.
Unsure why logic running linux on 4G back then would be different now. Regular ECC is barely adequate anymore. HP's Advanced ECC came out in the 90s and is far superior. IBM came up with Chipkill at a similar time(never used IBM servers myself). Dell never came up with anything but presented an "Advanced ECC" option in their Xeon 5500 systems I remember. It was a different technology vs HP. Dell's advanced ECC came from Intel and while it did the job it removed a third of the memory capacity of the system if I recall right. HP's has no overhead by default.(but can have overhead if you enable even greater protection from dimm failures including online spare memory and memory mirroring).
That said I just got my first laptop with ECC last week. Lenovo P15 with 8 core xeon and 80G of ram. Was kicking myself for not going ECC on my 2016 Lenovo P50 which I still use today, currently with 48G. Not that I've noticed any instability. It runs 24/7. Both laptops run linux mint 20.
Per linus wonder why he didn't just remove the faulty dimms. HP, and Dell perhaps others too can tell you which dimm is bad. I remember the pain involved when I used Supermicro years ago which could not do that, tracking down which was bad was annoying. I'd assume he has at least a half dozen dimms and can lose 2 or 3(maybehave to remove a full bank) and still have a more functional system than a laptop.
Actually, HP Advanced ECC didn't came out in the '90s but arrived with the introduction of the ProLiant Gen7 with DDR3 memory and XEON 5500 processors. It has been mostly used a way to lock customers into buying HP/HPE memory instead of 3rd party (the Adv ECC memory modules are not much different from regular modules aside from the HPE ID, the real difference in Advanced ECC is in the memory management on the mainboard).
IBM's Chipkill isn't much older.
Dell servers since the PowerEdge 11G don't rely on proprietary ECC modules, instead they simply hot-spare 1/3rd of the installed modules (which are plain standard ECC registered memory DIMMs) which jump in when a memory module fails during operation. Some PowerEdge servers also offer memory mirroring (i.e., RAID1 for memory). This is also why there is less usable memory.
I'm quite sure I had Advanced ECC on Proliant G3.
I quoted 90s because this HP doc says as much
"To improve memory protection beyond standard ECC, HP introduced Advanced ECC technology in
In prehistoric days memory came as DIL chips to plug into sockets on memory boards or cards; the next generation you are thinking of was the SIMM (single in-line memory module) which had contacts on only one side. The first sort were 30-pin, and had either 8 (no parity) or 9 (with parity) bits. IIRC many PCs wanted 9-bit but Macs could use 8-bit, so you could swap one way but not the other. These were usually seen in 256k, 512k or 1M (byte) varieties. Next up were 72-pin SIMMS, with 32 (no parity) or 36 (parity) bits. The parity ones were used in fancy Unix workstations but by this time PCs could slum it with non-parity versions and would mostly just ignore the extra bits. These got up to about 64M per SIMM, but 64M was the maximum memory for many PC motherboards and seemed very lavish then. DIMMS (dual in line memory modules) took the revolutionary step of putting contacts on both sides to accommodate memory capacities going up faster even than UK house prices.
I lost track of ECC when DIMMS came along; I only know that I've got a little bag of ECC DDR DIMMs (512M and 1G) which never worked in any PC motherboard I tried. I suspect their previous owner had the same experience so was not unhappy with the £1 he took for them!
What's this weird obsession Linuxeros have with Macbooks? I understand why they'd want to have an Intel Macbook (Linux-friendly hardware with open-source drivers), but the new Apple Silicon Macbooks are actually hostile to Linux and only support MacOS. Is it some kind of status symbol I don't get?
but the new Apple Silicon Macbooks are actually hostile to Linux and only support MacOS.
Meanwhile, back in reality, the Asahi Linux FAQ states:
Apple allows booting unsigned/custom kernels on Apple Silicon Macs without a jailbreak! This isn’t a hack or an omission, but an actual feature that Apple built into these devices. That means that, unlike iOS devices, Apple does not intend to lock down what OS you can use on Macs (though they probably won’t help with the development).
So I guess we're defining "actually hostile" to mean "won't help with the development", and "only support MacOS" to mean "Apple allows booting unsigned/custom kernels"?
Elsewhere from the FAQ, to get ahead of the topic:
Apple still controls the boot process and, for example, the firmware that runs on the Secure Enclave Processor. However, no modern device is “fully open” - no usable computer exists today with completely open software and hardware ...mainstream x86 platforms are arguably more intrusive because the proprietary UEFI firmware is allowed to steal the main CPU from the OS at any time via SMM interrupts, which is not the case on Apple Silicon Macs. This has real performance/stability implications; it’s not just a philosophical issue.
At home, I run dual socket MB with ECC FB DIMMs. My home machine is probably 10 years old now and I'm seeing serious issues . My most recent DIMMs were bought used off eBay and I think one of them is going bad. Logically everyone should be using memory that can correct errors, but for me, I seriously wonder if things like ECC, Fully Buffered, Dual Socket and power-loss SSDs are worth the extra cost. I've seen plenty of people with SSDs that don't have power loss safety and people that don't have ECC memory that seem to have more reliable, cheaper and faster computers than I have. I'm still on mechanical HDs at home BTW.
One last think, I know this thread is about GNU\Linux, but I wish Microsoft would get rid of that stupid TPM requirement for Windows 11. I'm pretty sure TPMs don't increase security as much as most users believe they do. I'm pretty sure many Electrical Engineers can figure out easy ways to hack TPMs.
 - Specwise, this home machine is massively multi-threaded and should perform really well, but what I seem to be seeing is that many computer programmers still don't properly take advantage of parallelism as much as they could. For this reason, it is often better to maybe get a cheaper computer with fewer cores and really fast single threaded performance (i.e. high but stable clock rate with high MIPS).