back to article Linus Torvalds's faulty memory (RAM, not wetware) slows kernel development

If the next version of the Linux kernel emerges a little slower than usual, blame a dodgy DIMM in Linus Torvalds's AMD Threadripper-powered PC and the vagaries of the memory market. In a post responding to a kernel developer inquiring if he had missed a Git Pull, Torvalds on Sunday revealed the request was still in his queue …

  1. Sampler

    Thank you DDR5

    Being ECC by default means we can move on from this and finally put it behind where it deserves.

    1. Piro

      Re: Thank you DDR5

      There's lots of error checking on-DIMM, but unless you pay for the special ECC ones, it won't report jack to the OS, as far as I understand it. Gatekeeping is still in effect.

      1. Spazturtle Silver badge

        Re: Thank you DDR5

        The on-die ECC just corrects single bit errors that occur during the refresh cycles, this is needed because DDR5 is run at such high frequencies that it is unstable by default.

  2. BOFH in Training

    DDR5 only has on-die ECC, and there is no ECC when the data is being send to the CPU.

    It's not ECC in the way it is understood for actual ECC ram.

    Better then no ECC of course, but not ECC as most people will understand it.

    https://en.wikipedia.org/wiki/DDR5_SDRAM

    Luckily AMD seemed to have no issues if you use ECC ram, as long as your motherboard supports it. So you can use ECC on regular computing gear, if you get a suitable motherboard with your AMD CPU.

    Intel still thinks it's only used for servers and expensive workstations.

    1. Paul Crawford Silver badge

      Sadly finding an AMD motherboard that supporters ECC is very difficult as few say so, and most say 'non-ECC' on lists so just searching for 'ECC' is pointless.

      1. phuzz Silver badge

        Basically all ASRock boards support ECC

        1. Paul Crawford Silver badge

          Good to hear, though not all AMD CPUs do it seems:

          https://www.asrock.com/MB/AMD/X370%20Taichi/index.asp#Specification

          See the note "For Ryzen Series CPUs (Picasso and Raven Ridge), ECC is only supported with PRO CPUs." so while better than Intel, it is still a shame you have to go to such lengths to get workable ECC systems.

          1. doublelayer Silver badge

            I haven't researched this, but the specific codenames imply that the chips that didn't support ECC were the Zen 1 and 2 range APUs. If that's true, then it's less that AMD also carved out a set but rather that they've only added it recently. It appears that all Zen 3 or higher APUs should support it and for some reason, Zen 1-2 chips without integrated graphics do as well. Maybe they had a problem getting ECC support along with the integrated GPU when those were newer.

            Before buying things based on my hypothesis, check more thoroughly than I did because I might be proving my ignorance.

          2. Spazturtle Silver badge

            "ECC is only supported with PRO CPUs"

            That means supported in the official sense as in if it doesn't work you get get support for it under the warranty, but ECC works on the normal non-PRO Ryzens too.

        2. Yorick

          They did with AM4

          Asrock does support ECC throughout with AM4. With AM5 they just took the ECC language off the website though it’s still in the English language manual pdf.

          I’m still hoping they will support it and bring this back - and, they may have decided it wasn’t worth it as a differentiator.

      2. Anonymous Coward
    2. devin3782

      Indeed DDR5 on chip ECC is not proper ECC its just to make sure that due to have DDR5 works electrically that an error isn't introduced during bank swaps and other normal ram operations due to smaller processes these chips are produced on and the speed they run at.

      Only end-to-end checking of data between the CPU and data written to memory is ECC.

      For the chap wondering which AMD boards have ECC support here's the ones I know about:

      Gigabyte Aorus X570 Pro, Asrock Rack X470D4U, Asrock Rack X570D4U.

      1. Justin Clift

        Another ECC board

        As a data point, the ASRock B550M Pro4 also supports ECC.

        Am using one as my desktop, with a Ryzen 5600X and 4x Kingston KSM26ED8/16HD 16GB ram sticks.

        Note that the ram sticks in my system are running fine at 3200MT/s, and dmidecode reports:

        Error Correction Type: Multi-bit ECC

        and

        Total Width: 72 bits

        Data Width: 64 bits

    3. phuzz Silver badge
      Stop

      Worth noting that all recent AMD CPUs support ECC except Ryzen CPUs with built in graphics (eg Ryzen 5600G). Which is slightly annoying, as otherwise they'd be a good choice for a home NAS.

      (As I understand it, all the new Zen4 CPUs will have on board graphics, but will still support full ECC. DDR5 does also come in a full ECC version, as well as the normal stuff having 'sort-of' ECC).

      So, anyone want to buy 16GB of DDR4 ECC? Never used (except for a memtest run).

      1. dajames

        Worth noting that all recent AMD CPUs support ECC except Ryzen CPUs with built in graphics (eg Ryzen 5600G).

        My Ryzen 7 Pro 4750G supports ECC, that's why I chose it. Perhaps it's the "Pro" bit that makes it different, or maybe it's not quite as recent as you meant?

        The motherboard is an ASUSTeK PRIME A520M-A, by the way.

        1. phuzz Silver badge

          Ah yes, the Pro line does support it, I was talking about their consumer chips but I should have made that clear.

  3. Roger Kynaston
    Meh

    I worry about the longevity of Emperor Penguins

    This seems to show that there is a bit of a single point of failure by having one (albeit very dedicated and thorough) person to have the ultimate say over something so important.

    I hope that the kernel developers and Linus work to get a more resilient system without sacrificing the care with which the development is completed these days.

    1. Fruit and Nutcase Silver badge
      Alert

      The Emperor has no

      clothes[DR hardware].

      Shouldn't he have a backup laptop?

      1. jglathe

        Re: The Emperor has no

        or five?

    2. 42656e4d203239 Silver badge
      Trollface

      Re: I worry about the longevity of Emperor Penguins

      >>This seems to show that there is a bit of a single point of failure by having one... person to have the ultimate say over something so important

      I called it a while ago.... The future is Poettering..... Us penguinistas better learn to love Microsoft and their star employee.

      Once Linus is out of the picture, for whatever reason, watch, in raptures of delight, whilst Windows & Linux are morphed into one jack of all trades master of none OS. Systemd is just the start....

      /S <------- just in case anyone is under any illusion

      1. martyvis

        Re: I worry about the longevity of Emperor Penguins

        Lennart certainly has a lot of the same similar personality attributes that Linus has - which people either love or hate. When he was in Sydney 2007 for linux.conf.au my son and I sat down with him for a lunchtime meat pie if I recall correctly.

    3. Gene Cash Silver badge

      Re: I worry about the longevity of Emperor Penguins

      The buck eventually has to stop with SOMEONE. And hopefully someone competent in the matter.

      Decision-by-committee is the worst death by a thousand cuts.

      However, you do know you're teh awsum when your laptop develops a problem and it's worldwide news.

      1. Scotthva5

        Re: I worry about the longevity of Emperor Penguins

        >>Decision-by-committee is the worst death by a thousand cuts

        Sounds like you have been subjected to one too many "product planning" or "customer focus" meetings, as have I. Herding Jell-O cats is easier than getting a firm decision by a group of people with conflicting interests.

    4. jake Silver badge

      Re: I worry about the longevity of Emperor Penguins

      "This seems to show that there is a bit of a single point of failure by having one (albeit very dedicated and thorough) person to have the ultimate say over something so important."

      It's been addressed. Over two decades ago, on segfault.org (February 23rd, 2000) ... and more recently (and seriously) every couple of years ever since. Seems that kids today don't learn history, and wouldn't believe it if they did, so it gets re-hashed ad nauseum.

      Use the search engine of your choice and look up "What happens if Linus gets hit by a bus?".

      1. 42656e4d203239 Silver badge
        Happy

        Re: I worry about the longevity of Emperor Penguins

        >>"What happens if Linus gets hit by a bus?"

        Your post is, at the moment, the second hit for that search term on Google....

  4. Old Used Programmer

    Sometimes...it's required

    I built a dual Opteron system in 2002. Those processors *required* the use of ECC memory. The ECC DRAMs cost about 1/4 of the total build.

  5. sreynolds

    Excuses.....

    What he doesn't have a standby machine that he can use? Nothing in the cloud that he can rent. Maybe we should get the tin cans out for the poor guy.

    1. Claverhouse

      Re: Excuses.....

      Creating a universal kernel for everyone in the cloud is possibly not ideal from the point of protective security.

      1. doublelayer Silver badge

        Re: Excuses.....

        The code's open. People will be compiling it all over the place in a matter of days. Would compiling and testing on a compromised server really do anything? I think not, but even if it would, just using a cloud machine doesn't mean there's anything wrong with it. The classic reason to distrust the cloud is the provider having access to the private data, but as this is open source code, there isn't private data to be had.

        1. sreynolds

          Re: Excuses.....

          Is Linus still taking patches of the LKML (Linux Kernel Mailing List) or have they regressed to sharing usb drives, which have just recently replaced floppies.

  6. Electronics'R'Us
    Holmes

    It is easy

    To implement an ECC memory system provided the processor itself supports it; I have done plenty of them.

    What needs to be done at motherboard level is to expose the ECC signals (for classic ECC) from the processor to the memory devices. ECC is just an extra memory device as the actual ECC (again, classic ECC) is done in the memory controller. Correct one, detect two is the most common form. The extra memory device stores the ECC syndrome bits.

    It is only a few connections (10 or 11 depending on details and version of the interface) and does not add significant cost to the design but Intel et. al. want people to think it is somehow magical and expensive - it isn't.

    There is a type of embedded ECC (I have seen it in SRAMs) where the checks are done within the device. The testing for this is often done at LANSCE as the majority of problems are due to free neutrons and there is a neutron beam facility to see just how robust the solution is.

    Devices that start to fail within a short time (which is a few years) are rarely due to age, but more likely the device has heated up significantly (to the point that the refresh rate has to be increased - provided that has been enabled which it probably is not).

    Excessive heating of semiconductor devices shortens the life of them quite significantly - the rule of thumb is the failure rate can double for each temperature rise of 10C (see the Arrhenius_equation).

    L1 parity and L2/L3/main memory ECC is required in flight safety critical avionics which is why Intel parts are not used in those applications.

  7. gnasher729 Silver badge

    So in this case an unremovable alert “memory chip broken” would have been all he needed.

    Haven’t had this problem in 30 years. Before that, as a software developer: App crashed if the largest source file was compiled on one particular machine in the afternoon. After that: 4 out of 7 machines with third party RAM crashed quite reliably if you went for a coffee and stayed away for more than 15 minutes.

    1. sreynolds

      Mate, you see the shit that goes on in space. That ionizing radiation not only causes errors in the charge stored in capacitors like DRAM but also errors in the ALU.

  8. Anonymous Coward
    Anonymous Coward

    It really wasn't that long ago when ECC could be had on any old consumer board. I had a P3 system which worked happily with it, and I'm sure I had various K7's that would take them too.

    Workstation pricing and organisations in money-no-object territory have created the rift that makes ECC "rare" for no really good reason.

  9. TeeCee Gold badge
    WTF?

    Hang on..

    "I'll probably leave memtest86+ for another overnight with the new DIMMs...

    If Linus feels the need to do that, what exactly is the point of ECC memory?

    1. John Robson Silver badge

      Re: Hang on..

      Because you can still get bad chips, and failing during a test is less disruptive than failing in production.

      ECC isn't magic, it can only correct a small count of failures, and reliably detect a few more...

    2. Victor Ludorum

      Re: Hang on..

      There is a possibility it's not the DIMM, but a component on the motherboard that has failed. Soak testing the new memory will help to eliminate that possibility.

    3. Robert Grant

      Re: Hang on..

      His post also mentions that his main PC was set up for error correction code memory (ECC memory), but "during the early days of COVID when there wasn't any ECC memory available at any sane prices. And then I never got around to fixing it, until I had to detect errors the hard way."

      From the article.

    4. Not Yb Bronze badge

      Re: Hang on..

      This is always a good idea for system RAM, and new computer builds in general. Helps get past the "early failure" and "overly heat sensitive" chips faster so you know you're more likely to have reliable memory.

      ECC doesn't save you from bad RAM, it helps detect it, and occasionally can fix for long enough to complete tasks before having to replace the RAM.

  10. Fazal Majid

    The newest (12th gen) Alder Lake non-Xeon CPUs do support ECC

    But only if you use an Intel W680 workstation (I.e. expensive) chipset, e.g. in the HP Z2 Mini G9.

    Making ECC a Xeon-only feature was a classic case of market segmentation by a monopolist to allow them to extract maximum profits from enterprise customers willing to pay more for reliability.

    1. SVD_NL Silver badge

      Re: The newest (12th gen) Alder Lake non-Xeon CPUs do support ECC

      The problem is that this doesn't solve the segmentation they created at all.

      All W680 motherboards I've found so far have a request a quote button instead of a price. This tells me "not for consumers" and "if you have to ask, it's probably too expensive".

      The listings i can find, peg the currently available ones at about $450. These boards definitely do not have feature parity with available consumer boards in this price range. This tells me Intel is still asking a crazy price for a chipset that is fundamentally the same as the consumer counterpart. The stat sheets list only vPro and ECC support as differences.

      Consumers will be looking for Z690 boards, so manufacturers are unlikely to make W680 boards as it will directly influence their product discoverability.

      Unless they just enable ECC support for consumer chipsets, it doesn't matter that they pat themselves on the back about this, because it's just inconsequential. There's still market segmentation, and in the end very few consumers will be able to use this feature.

      I understand the price increase for Intel vPro, but for ECC it's just ridiculous.

  11. captain veg Silver badge

    to be fair

    I'm sure that Linus' rant against Intel is justified, but at least he has the option of replacing the DIMMs. With the Apple Silicon MacBook, not so much.

    -A.

  12. vtcodger Silver badge

    Take This With a Grain of Salt

    I'm operating mostly from my elderly and not so great human memory here. But I think the background on ECC is rather complex. The original IBM PC (1981) memory had 9 bit bytes -- 8 data bits and a parity bit. So did its clones and all the subsequent PCs that became common in the late 1980s. And they really needed parity because memory wasn't all that reliable back then. And early PCs didn't use a lot of it. 640K -- all Intel could easily access -- was a lot in the early 1980s. After a while people found ways to use more. But not a lot more. Which was good because memory was expensive. I'm pretty sure that I recall commodity 1MB "DIMMs" (I think we called their predecessors something else back then) going for $100 a MB back in the mid 1990s.

    In the 1990s it became obvious to "The Industry" that GUI was the future and that GUI OSes were going to need LOTS of memory. So they looked for corners to cut in order to keep PCs affordable. One of those corners was memory parity. Get rid of that ninth bit and we can get more data bits on a chip. And make a bit more profit. And maybe even make the products a bit cheaper for the consumer. And for those few who REALLY care about memory integrity we'll do something more sophisticated than parity. Hamming Codes. If you're curious, here's a link to an article I originally wrote for the Compuserve Hardware Forum about three decades ago. http://donaldkenney.x10.mx/GLOSSARY/ECCMEMOR.HTM. Back in the 1990s, many gave credit/blame for the disappearance of parity in consumer PC memory to Microsoft. Maybe they were right.

    PC hardware and software in the 1990s often didn't work all that well even on good days. I personally think lousy non-parity memory was part of the problem. But there were many other issues. So we don't have parity in consumer PCs nowadays. And AFAICS, no one knows (or much cares) if that's a significant problem.

    1. ThomH

      Re: Take This With a Grain of Salt

      > 640K -- all Intel could easily access -- was a lot in the early 1980s.

      *cough* 'Intel' (i.e. 16-bit x86) can easily access 1mb of address space.

      The decision to make only 640kb of that general-purpose RAM was IBM's; the rest is reserved for video card ROM and RAM, the BIOS ROM, etc.

      1. jake Silver badge

        Re: Take This With a Grain of Salt

        And even then, you could fairly easily bump that 640 to 704 by relocating VGA video ROM higher up, or removing it entirely if all you needed was a text console (which was almost everybody, in the early days).

        1. Nick Ryan

          Re: Take This With a Grain of Salt

          If it was that early then it would likely have been EGA or even CGA video. There was something before that but it's lost to my memory, and probably just as well. Moving the ROM or RAM really depended in whether or not the card manufacturer supported it and if you had the patience to play the jumper game (I probably still have oodles of spare jumpers) and manage to get whatever cards were in a system somehow cooperating on IRQs and memory space.

    2. DoContra

      Re: Take This With a Grain of Salt

      From my understanding, ECC memory really becomes important/proper useful in consumer situations when the PC stays turned on or on S<=3 standby for long periods of time, and with large amounts/density of RAM (with a crossover somewhere between 16 and 32GB of RAM, esp. with sticks of >= 16GB); switching off/rebooting a PC on the regular shields you from the grunt of random bit-flips (as long as your sticks are not having issues). Admittedly, this seems to be the case currently (from my experience being the IT guy at my work, most people will let their computers suspend to RAM, or just leave them on continuously ¬¬).

      Fully agree with you that dodgy RAM could've been a big part of software instabilities in the 90's (or at least, much bigger than given credit to), but I see it more as a QC/field failure of sticks rather than "this wouldn't have happened with ECC/parity".

      PD: I will admit I used to S3 suspend my machine for a couple of years, but nvidia drivers on GNU/Linux, plus switching to an SSD, made me stop

      1. Nate Amsden

        Re: Take This With a Grain of Salt

        I was running servers 20 years ago with linux with 3 to 4G of ram and all of it ECC.

        Unsure why logic running linux on 4G back then would be different now. Regular ECC is barely adequate anymore. HP's Advanced ECC came out in the 90s and is far superior. IBM came up with Chipkill at a similar time(never used IBM servers myself). Dell never came up with anything but presented an "Advanced ECC" option in their Xeon 5500 systems I remember. It was a different technology vs HP. Dell's advanced ECC came from Intel and while it did the job it removed a third of the memory capacity of the system if I recall right. HP's has no overhead by default.(but can have overhead if you enable even greater protection from dimm failures including online spare memory and memory mirroring).

        That said I just got my first laptop with ECC last week. Lenovo P15 with 8 core xeon and 80G of ram. Was kicking myself for not going ECC on my 2016 Lenovo P50 which I still use today, currently with 48G. Not that I've noticed any instability. It runs 24/7. Both laptops run linux mint 20.

        Per linus wonder why he didn't just remove the faulty dimms. HP, and Dell perhaps others too can tell you which dimm is bad. I remember the pain involved when I used Supermicro years ago which could not do that, tracking down which was bad was annoying. I'd assume he has at least a half dozen dimms and can lose 2 or 3(maybehave to remove a full bank) and still have a more functional system than a laptop.

        1. Anonymous Coward
          Anonymous Coward

          Re: Take This With a Grain of Salt

          Actually, HP Advanced ECC didn't came out in the '90s but arrived with the introduction of the ProLiant Gen7 with DDR3 memory and XEON 5500 processors. It has been mostly used a way to lock customers into buying HP/HPE memory instead of 3rd party (the Adv ECC memory modules are not much different from regular modules aside from the HPE ID, the real difference in Advanced ECC is in the memory management on the mainboard).

          IBM's Chipkill isn't much older.

          Dell servers since the PowerEdge 11G don't rely on proprietary ECC modules, instead they simply hot-spare 1/3rd of the installed modules (which are plain standard ECC registered memory DIMMs) which jump in when a memory module fails during operation. Some PowerEdge servers also offer memory mirroring (i.e., RAID1 for memory). This is also why there is less usable memory.

          1. Nate Amsden

            Re: Take This With a Grain of Salt

            I'm quite sure I had Advanced ECC on Proliant G3.

            I quoted 90s because this HP doc says as much

            http://service1.pcconnection.com/PDF/AdvMemoryProtection.pdf

            "To improve memory protection beyond standard ECC, HP introduced Advanced ECC technology in

            1996. "

    3. Anonymous Coward
      Anonymous Coward

      Re: ancient memories

      In prehistoric days memory came as DIL chips to plug into sockets on memory boards or cards; the next generation you are thinking of was the SIMM (single in-line memory module) which had contacts on only one side. The first sort were 30-pin, and had either 8 (no parity) or 9 (with parity) bits. IIRC many PCs wanted 9-bit but Macs could use 8-bit, so you could swap one way but not the other. These were usually seen in 256k, 512k or 1M (byte) varieties. Next up were 72-pin SIMMS, with 32 (no parity) or 36 (parity) bits. The parity ones were used in fancy Unix workstations but by this time PCs could slum it with non-parity versions and would mostly just ignore the extra bits. These got up to about 64M per SIMM, but 64M was the maximum memory for many PC motherboards and seemed very lavish then. DIMMS (dual in line memory modules) took the revolutionary step of putting contacts on both sides to accommodate memory capacities going up faster even than UK house prices.

      I lost track of ECC when DIMMS came along; I only know that I've got a little bag of ECC DDR DIMMs (512M and 1G) which never worked in any PC motherboard I tried. I suspect their previous owner had the same experience so was not unhappy with the £1 he took for them!

  13. kurkosdr

    Macbook?

    What's this weird obsession Linuxeros have with Macbooks? I understand why they'd want to have an Intel Macbook (Linux-friendly hardware with open-source drivers), but the new Apple Silicon Macbooks are actually hostile to Linux and only support MacOS. Is it some kind of status symbol I don't get?

    1. ThomH

      Re: Macbook?

      but the new Apple Silicon Macbooks are actually hostile to Linux and only support MacOS.

      Meanwhile, back in reality, the Asahi Linux FAQ states:

      Apple allows booting unsigned/custom kernels on Apple Silicon Macs without a jailbreak! This isn’t a hack or an omission, but an actual feature that Apple built into these devices. That means that, unlike iOS devices, Apple does not intend to lock down what OS you can use on Macs (though they probably won’t help with the development).

      So I guess we're defining "actually hostile" to mean "won't help with the development", and "only support MacOS" to mean "Apple allows booting unsigned/custom kernels"?

      Elsewhere from the FAQ, to get ahead of the topic:

      Apple still controls the boot process and, for example, the firmware that runs on the Secure Enclave Processor. However, no modern device is “fully open” - no usable computer exists today with completely open software and hardware ...mainstream x86 platforms are arguably more intrusive because the proprietary UEFI firmware is allowed to steal the main CPU from the OS at any time via SMM interrupts, which is not the case on Apple Silicon Macs. This has real performance/stability implications; it’s not just a philosophical issue.

    2. Yet Another Anonymous coward Silver badge

      Re: Macbook?

      Have any alternative high performance consumer ARM system to test on?

      Or you could only do builds on Intel and hope the code is portable.

      IIRC the emperor penguin used to do builds on a PowerPC for similar reasons

  14. Anal Leakage

    No shit, Sherlock

    Leave it to a Reg hack to draw performance comparisons between a Threadripper and an M2 MacBook Air. You know which platform would make for even worse compiling performance? A Psion.

  15. janusng
    FAIL

    Apple Silicon has RAM built-in. It cannot be an ARM64 MacBook.

    There is no RAM slot on M1 or M2 MacBook. All the RAM come inside the SOC. There is no way Linus ordering RAM for it.

  16. -tim
    Facepalm

    How about an error message?

    Perhaps it is time that the boot process produce a warning for systems that don't support ECC.

    I've noticed that old systems that properly report ECC errors tend to do so around the time of unusual solar activity.

  17. shawn.eary

    At home, I run dual socket MB with ECC FB DIMMs. My home machine is probably 10 years old now and I'm seeing serious issues [1]. My most recent DIMMs were bought used off eBay and I think one of them is going bad. Logically everyone should be using memory that can correct errors, but for me, I seriously wonder if things like ECC, Fully Buffered, Dual Socket and power-loss SSDs are worth the extra cost. I've seen plenty of people with SSDs that don't have power loss safety and people that don't have ECC memory that seem to have more reliable, cheaper and faster computers than I have. I'm still on mechanical HDs at home BTW.

    One last think, I know this thread is about GNU\Linux, but I wish Microsoft would get rid of that stupid TPM requirement for Windows 11. I'm pretty sure TPMs don't increase security as much as most users believe they do. I'm pretty sure many Electrical Engineers can figure out easy ways to hack TPMs.

    [1] - Specwise, this home machine is massively multi-threaded and should perform really well, but what I seem to be seeing is that many computer programmers still don't properly take advantage of parallelism as much as they could. For this reason, it is often better to maybe get a cheaper computer with fewer cores and really fast single threaded performance (i.e. high but stable clock rate with high MIPS).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like