back to article Intel to deliver fix for Raptor Lake CPUs made 'unstable' by voltage snafu

Intel has promised to deliver a fix for some of its recent desktop processors suffering "stability issues." In June, Chipzilla finally admitted it had confirmed reports that some of its 13th- and 14th-gen silicon – aka Raptor Lake – is wonky on account of "elevated voltage input to the processor due to previous BIOS settings …

  1. Geoff Campbell Silver badge
    Pirate

    Too much complexity

    If I may channel the spirit of the Reg's FOSS desk correspondent for a moment, this is a sure sign that CPU design has got massively too complex, as a result of poor software design favouring brute force performance over parallelism.

    Back in the 1980s, the Transputer demonstrated what could be achieved by a very simple processor design in a big array, coupled to good software designed to take advantage of the parallelism this allowed. Then we ignored that lesson, and went for brute force instead.

    I'm not hopeful that this is being fixed, as I see few examples of mainstream software really, truly utilising parallelism. Oh, well...

    GJC

    1. Spazturtle Silver badge

      Re: Too much complexity

      "the Transputer demonstrated what could be achieved by a very simple processor design in a big array, coupled to good software designed to take advantage of the parallelism this allowed."

      That is essentially what the compute based GPU design that the industry switched to in the 2010s is. The problem is that some things just don't scale well in parallel.

      1. that one in the corner Silver badge

        Re: Too much complexity

        > The problem is that some things just don't scale well in parallel.

        As we have seen machines moving to more and more cores in the CPU (with or without hyperthreading), it is only too obvious how few programs can even take advantage of using "more of what we are used to", let alone reworking to take advantage of GPUs. Bring up Task Manager - and fire up lots of separate apps just to make it look as though that CPU was a sensible buy.

        (Then again, given what I've seen coders do with mutexes it is probably a good thing that we don't have everybody trying to parallelize...)

        1. Steve Graham

          Re: Too much complexity

          I compiled the latest Linux kernel this morning, and noticed that all 16 "CPUs" (real cores and hyperthreaded, um, threads) were running pretty consistently at full speed.

          1. John Robson Silver badge

            Re: Too much complexity

            No doubt, but that's not a typical windows computer workload.

            Whilst server farms are no doubt optimised to reduce the number of idle cores, the majority of the world's desktops spend the vast majority of their time idle.

      2. Anonymous Coward
        Anonymous Coward

        Re: Too much complexity

        One problem I deal with that is hugely compute intensive, needs to make a series of matrix inversions along the length of an object; with the output of one "slice" having to be passed as input to the next.

        Each matrix operation can be parallelised; at some cost in developer time. This isn't so bad to do.

        It is obviously impossible to directly do the length element in parallel. Something predictive could maybe be used to estimate the range of possible inputs to the next slice, before the results are available; and then throw away the un-used data by the time the actuals are found. If the error in the predictive element from actual does not end up being bigger than waiting for the actual. Or the thermals being disadvantaged by having to throw away a lot of guesstimate calcs.

        Pre-calculated lookups for possible slice results might be viable; if the number of permutations wasn't ludicrous. It is how pen-and-paper approaches originally developed in the 1930's had to operate (with lack of detail requiring caution in assumptions).

        Parallelism has it's place, absolutely, but the development tools to do it well require considerable effort and additional testing.

        If your list of users is measured in the range of maybe 100 people, globally, that does tend to set a budget on how much development you can actually do.

        I shouldn't complain too much though. Such problems are half interesting to think about. As opposed to cleaning up excel messes...

    2. Lee D Silver badge

      Re: Too much complexity

      There's a reason that microkernels and parallel computing aren't ruling the world.

      It doesn't always work as well in theory as you think because most things just aren't parallisable and the overhead involved in managing the logistics of both loses all the (theoretical) performance gains that you can get.

      In the case of the former, it's supposed to provide better isolation and security, not performance. In the case of the average home machine your GPU is basically an isolated machine for a reason - it's inherently insecure in itself and the parallel code it executes can play with all kinds of stuff on the chip that it's not supposed to. The performance vs security tradeoff means that they made it very fast, very insecure, and then they only secure the stuff you can send to the card.

      It's been that way with OSs and CPUs for far longer - Windows was performant and simple, and sacrificing security for that caused almost the problems that older and even modern Windows suffers with. Even basic memory security requires so many more checks, balances, isolation, middle-men handling data, etc. that it slows things down. Remember Spectre/Meltdown? Basically caused by a security tradeoff for performance in speculative execution (literally a performance-first feature).

      And all the "first technologies" are insecure for years buy that helps them work faster and get to market first, and that almost always wins. 3DFX cards allowed arbitrary DMA to any part of memory for decades before anyone noticed that's how they worked.

      Despite everyone's wrangling, few people actually care about security - they just want stuff to work and work fast and will pay for a faster chip over a more secure one. People's primary gripe about the Spectre/Meltdown fixes was "Oh, but it slows my computer down".

      Parallelism, though, is a very different beast - we can go to 128-core chips and we all have GPUs with highly-parallel workloads on them. What do we use them for? Games. Or things like farmed off the antivirus and other stuff that we don't want to "get in our way" for performance. Do we actually do anything that takes significant advantage of 128-cores and their parallelism? No. we just run 128 single-threads at the clock-speed we already had. No one program sees the benefit of such parallelism, we just run more programs (that are each unsuited to parallelism) simultaneously, and are still bound by the clock-speed. And if you have 128-cores at a slower clock rate than a 64-core chip, people aren't going to queue up to buy the 128-core chip.

      Even our most popular programming languages struggle to express any significant use of parallelism. Architectures simply aren't built for it. And hardware is designed for speed first.

      It's not any one component... it's that nobody is prepared to sacrifice raw single-thread performance for anything else, even security in many cases. It's a market problem, a software problem, a hardware problem, a sales problem, a programming language expression problem, and ultimately nothing is set up to go that way and nothing that goes that way really takes off how you would expect, and hasn't in all the decades we've had it. Most cloud computing is just bog-standard PC architectures lumped into highly-available groups, with workloads distributed by third-party software, no real parallelism even though it forms a perfect use case for such. Instead we deploy individual VMs on specific hosts, and software load balancers, rather than farming the work out across dozens of connected PCs able to parallelise the work required across them all.

      Because, when you get down to it, parallel computing is hard to achieve in hardware, much more difficult to program, understand and debug, slower than other methods, and requires far-greater interconnects, memory speeds, etc. to get close to matching the performance of just lobbing it at a dedicated processor, or even chopping it up logistically and lobbing it at ten processors via software.

      1. gryff
        Boffin

        Re: Too much complexity

        Hmmm...all of this reminds me of a small probably (best) forgotten project at ICL in the 1990s called ...Goldrush Megaserver..?

        The h/w side was massively parallel "first of a kind."

        One reason it didn't go far was the need to re-write your code, coupled with an absence of deep, advanced tools to facilitate such a developements.

        QUOTE:

        "Because, when you get down to it, parallel computing is hard to achieve in hardware, much more difficult to program, understand and debug, slower than other methods, and requires far-greater interconnects, memory speeds, etc. to get close to matching the performance of just lobbing it at a dedicated processor, or even chopping it up logistically and lobbing it at ten processors via software."

    3. Richard Tobin

      Re: Too much complexity

      The transputer did indeed demonstrate what could be achieved by a very simple processor in a big array, and the answer was "not much". For several years people enthused over it, but failed to produce useful solutions with it. It became clear that the vast majority of tasks just weren't amenable to being solved that way: they have parts that are inherently sequential, and even when parallelized need access to shared memory.

    4. Anonymous Coward
      Boffin

      Re: Too much complexity

      > .. Back in the 1980s, the Transputer demonstrated what could be achieved ..

      The reason the Transputer didn't happen was because Intel didn't own it. Commercial enterprises invariably settle out into a duopoly. In software and chip design, it seems to have happened a lot faster. Who is left of the companies that make x86 chips.

      1. DS999 Silver badge

        Re: Too much complexity

        Intel owning it wouldn't have helped it.

        Just look at Intel's history when it comes to doing anything not x86:

        i432: failure

        i860: failure

        Itanium: failure

        Transputer failed because that type of architecture is only applicable to a narrow range of problems. Exactly the sort of problems people solve with a GPGPU today - but even that is being greatly aided by decades of compiler and language advances since Transputer.

        If Intel had developed or purchased Transputer it would have been added to the list of Intel failures above.

    5. Mage Silver badge
      Facepalm

      Re: Too much complexity

      Yes, I've had microcode upgrades within Linux (reboot when you feel like it). On XP I had to make a floppy and boot with it.

      I think Thatcher killed Inmos. I remember Transputer demos (was at a live one). IP etc sold to Thomson who only used the CPU design for standalone parts.

      It was the era when the 386 was new.

  2. Anonymous Coward
    Anonymous Coward

    Microsoft are to blame

    For releasing an OS that requires some third-party compinent this complicated to work properly in the first place.

    What? Ah. Sorry. This isn't about machines crashing due to CrowdStrike, is it? There were so many articles about that I just got into a rut.

    Ahem. Obviously, the correct response here should have been:

    "Ha ha, I run AMD, nyaah".

    1. ecofeco Silver badge
      Facepalm

      Re: Microsoft are to blame

      There's a reason El Reg runs a lot of Intel and M$ stories.

      Because Intel and M$ suck. And have for decades.

      Do try and keep up.

  3. nematoad Silver badge
    Happy

    Intel.

    "The chip shop."

    Well they certainly fried something !

  4. rgjnk
    Alert

    Damage

    The microcode might 'fix' it, but my rusty semiconductor memories about things like electromigration suggest that having run at elevated temperature & voltage for extended amounts of time might have already led processors too far down the road to an early death.

    So even after they're fixed it might still be a bit too late, possibly more stable for a while but still reduced lifespan and stability going forward?

    Though I guess if they survive long enough to be beyond warranty that's considered 'good enough' as a fix!

    1. Irongut Silver badge

      Re: Damage

      There are rumours among OEMs and cloud providers that there is a manufacturing defect in these chips. We'll see how far Intel can fix the problem with a microcode update but if those rumours are correct the problem is not going away. One OEM estimates millions of chips are affected among their customers alone.

      1. Spazturtle Silver badge

        Re: Damage

        Intel has put out another statement admitting that there was a manufacturing defect with the anti-oxidation layer but has said that it is now fixed.

        I wonder how many other defects they know of but has not yet disclosed.

        1. An_Old_Dog Silver badge

          Re: Damage

          I wonder gow many other defects they know of but has not yet disclosed

          A cynical person would suggest tracking that on a going-forward basis by tracking large sales and trades of Intel stock by members of the Board of Directors.

          "Problems crest? Divest, divest, divest!"

    2. JRStern

      Re: Damage

      It's not covered here, but at another site the coverage does say some "aging" involved, running the wrong power algorithm does permanent damage and Intel is supporting RMAs.

      What is better covered here, if it is accurate, is that this power algorithm only goes wonky on the limited, unlocked chips.

      There is also mention of aging/corrosion fab failure that they say was caught and quickly fixed, but it's not clear quite how quickly.

      And there is chatter that this might extend to other chips in the line, even regular locked chips, that use the same designs.

      I don't have enough involvement to judge how much of this is likely or certain.

  5. Bartholomew
    Meh

    lower voltage

    Can I just take one second, to say lower voltage means lower current, and lower current means it takes longer to activate/deactivate transistors. In other words Intel's solution is to lower their average clock rate in some of the clocks used. But of course that is not going to generate good PR, so they have used lower voltage, which is technically correct, to hide the solution.

    Now I'm not saying that there will be drop in average performance (because in modern CPU's enough parameters can be tweaked to disguise this e.g. make the peak clock used higher* and lowest clock used higher), but there will probably be a drop in efficiency.

    *higher means that thermal overload will be reached much faster, and then the device will need to be under-clocked for longer to cool-down. But application startup times will feel faster (or the same, in the case of this workaround). It will also reduce the useful lifetime of the product, but as long as it is outside of the warranty period, that is not really a concern for Intel.

    1. Snake Silver badge

      Re: lower voltage

      "Can I just take one second, to say lower voltage means lower current, and lower current means it takes longer to activate/deactivate transistors. In other words Intel's solution is to lower their average clock rate in some of the clocks used. But of course that is not going to generate good PR, so they have used lower voltage, which is technically correct, to hide the solution."

      That's not the way I'm reading it, I'm reading it as: the microcode isn't power throttling the CPU properly when the thermal limits are reached.

      "In June, Chipzilla finally admitted it had confirmed reports that some of its 13th- and 14th-gen silicon – aka Raptor Lake – is wonky on account of "elevated voltage input to the processor due to previous BIOS settings which allow the processor to operate at turbo frequencies and voltages even while the processor is at a high temperature."

      "Based on extensive analysis of Intel Core 13th/14th Gen desktop processors returned to us due to instability issues, we have determined that elevated operating voltage is causing instability issues," communications manager Thomas Hannaford wrote in a Monday post.

      "Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor," he added.

      All modern processors apply a power limit at thermal limits but it sounds like they mis-programmed the regulation system and the CPU doesn't put in enough limiting, allowing the CPU to thermal beyond limits. A positive feedback loop, if you will.

      "Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages," Hannaford's post states, with the silicon-slinger currently testing to ensure its revised code makes the problem go away.

      At least I called it 9 days ago, as a voltage problem

      https://forums.theregister.com/forum/containing/4894579

      The motherboard manufacturers allowing limits beyond reason only exacerbated the problem, as the MB wouldn't kick in any limit protection of its own if the CPU asked for power beyond its thermal abilities to dissipate.

    2. Ace2 Silver badge

      Re: lower voltage

      The CPU is wired up to a chip called a VRM (Voltage Regulator Module). The CPU is continuously communicating to the VRM exactly what V it requires at any given time - it changes constantly as the frequency, utilization, and temperature bounce around. If at any point the system gets a little out of whack, you’re going to see instability.

  6. Anonymous Coward
    Anonymous Coward

    Minty fresh tartar removal

    '"KS" kit is for enthusiasts who probably have opinions about the qualities of different thermal pastes'

    The kind of people who don't use computers but can wank on for hours about them without any actual knowledge or useful information being shared.

    1. Anonymous Coward
      Anonymous Coward

      Re: Minty fresh tartar removal

      Ooh bless, that hit a nerve.

      I prefer my hardware working and stable so I can forfeit a tenth of a second in my compile times, an added bonus of that is I don't need to waste my life worrying about which flavour of unicorn shit is in vogue this month because someone noticed half a degree difference in their flawed 24x7 thermal monitoring trial.

  7. Updraft102

    Linux and Windows both update microcodes at the OS level when they become available. Linux is not alone in that, though it does so more reliably (as Microsoft typically reserves those updates for its most recent OS, letting the old one be ignored).

  8. Henry Wertz 1 Gold badge

    finicky microcode update?

    most of my systems are well out of support but uisually the microcode update is done in 2 ways...

    BIOS update. The BIOS then updates the microcode on each powerup.

    OS update. Windows has gotten microcode updates via Windows upddate before (Meltdown etc.).

    Linux can either load microcode early from grub (not usually used, it's for if the microcode is in bad enough shape it can't boot which is obviously uncommon) or a few seconds later in the boot process.

  9. Anonymous Coward
    Anonymous Coward

    voltage requests TO the processor?

    From the article: "Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor," he added.

    Are the voltage requests going TO the processor, or coming FROM the processor?

    Surely the latter, not the former.

  10. FIA Silver badge

    Put those pesky crashes in the past with a fun microcode upgrade!

    Not necessarily so.

    If you have an affected CPU the microcode update should stop it breaking, but it won't reverse any damage already caused.

  11. cenpjas
    Flame

    Is intel stalling?

    Not a fan of conspiracies, but reading up on all the different storys OEM's and Data centres have been fed, and the 'blame your motherboard vendor' stuff intel spread, I think we're not going to see a pure software fix. One of the reasons is damage already done to all the CPU's not just the ones that have already failed. A presumption of mine would be a forced voltage limit which would effectively defeat the point of K class Intel Cpu's.

    To go further on the process errors (at fab), we may find over time more models cpu's are effected. If it's fab issues at least Intel can recover somewhat.

    Lastly, as I understand it, Intel Engineers (or at least the project leads) were braging that the design for Raptor Lake was faster than the AMD team doing Zen 4 (if I remember right).

    I suspect short cuts to get product to market is nibbling Intel and it's customers arses.

    1. cenpjas
      Mushroom

      Re: Is intel stalling? [Yes, and miss directing]

      I know replying to my own post makes me one of 'them', but watched a video drop just after posting which Intel admitted to Process issues.

      Long story short, search: "Intel's Biggest Failure in Years: Confirmed Oxidation & Excessive Voltage", on youtube.

      Not that anybody will read this now its fallen down the news listings :)

      1. Colin Wilson 2

        Re: Is intel stalling? [Yes, and miss directing]

        I read it :)

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like