back to article FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof

Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …

Page:

  1. Blackjack Silver badge

    Somehow this will all end with Intel having part of the blame...

    1. imanidiot Silver badge

      Since some parts of modern x86 and x64 chip designs result directly or indirectly from decisions Intel has made in the past, it's likely at least some small part of the blame will lie with Intel. Whether they should have known better (like with the whole IME and predictive threading debacle) remains to be seen.

    2. The Man Who Fell To Earth Silver badge
      Black Helicopters

      Allow one to disable a core

      One band-aide would be to simply stop using a core found to be mercurial, ideally by switching it off, less ideally by having the OS avoid using it.

      1. Cynic_999 Silver badge

        Re: Allow one to disable a core

        But that assumes that an error in a particular core has been detected in the first place. Where errors would have serious consequences, I would say the best policy would be to have at least 2 different cores or CPUs running the same code in parallel, and checking that the results match. Using 3 cores/CPUs would be better and allow the errant device to be detected. Best for each to have their own RAM as well.

        1. HammerOn1024

          Re: Allow one to disable a core

          So much for my power supplies... and power plants... and power infrastructure in general.

        2. SCP

          Re: Allow one to disable a core

          aka Lockstep processing, with Triple Core Lockstep (TCLS) being something proposed by ARM.

      2. AVR

        Re: Allow one to disable a core

        Assuming that it's not effectively running its own private encryption as described in the article. That might require you to switch it back on for a while.

        Also, apparently even finding out that the core is producing errors can be tricky.

        1. Malcolm 5

          Re: Allow one to disable a core

          that comment about encryption that can only be reversed by the same core intrigued me - maybe I am thinking too advanced but is anything other that XOR encryption really so symmetric that a processor bug would hit encryption and decryption "the same"

          I guess it could be that there was a stream calculation was doing the same thing and feeding into an XOR with the data

    3. NoneSuch
      Devil

      Out of the Box Thinking

      The grief is probably caused by native NSA authored microcode that needs an update.

    4. Anonymous Coward
      Anonymous Coward

      There will be a logo and a silly name; just wait and see.

  2. Richard Boyce

    Error detection

    We've long had ECC RAM available, but only really critical tasks have had CPU redundancy for detecting and removing errors. Maybe it's time for that to change. As chips have more and more cores added, perhaps we could usefully use an option to tie cores together in threes to do the same tasks with majority voting to determine the output.

    1. AndrewV

      Re: Error detection

      It would be more efficient to only call in the third as a tiebreaker.

      1. yetanotheraoc

        Re: Error detection

        As in, appeal to the supreme core?

        1. Ben Bonsall

          Re: Error detection

          As in minority report...

          1. juice Silver badge

            Re: Error detection

            >As in minority report...

            I think that "voting" concept has been used in a few places - including, if memory serves, the three "Magi" in Neon Genesis Evangelion.

            https://wiki.evageeks.org/Magi

            There's even a relatively obscure story about a Bolo (giant sentient tanks), in which the AI's multi-core hardware is failing, and it has to bring a human along for the ride while fighting aliens, since there's a risk that it'll end up stuck with an even number of "votes" and will need to ask the human to act as a tie-breaker...

            1. J. Cook Silver badge

              Re: Error detection

              ... rumor has it that the Space Shuttle's early avionics and main computer was something akin to a 5 node cluster working on the same calculations, and that the result that at least three of the nodes had to be identical.

              of something like that.

              1. bombastic bob Silver badge
                Devil

                Re: Error detection

                without revealing [classified information] the concept of "2 out of 3" needed to initiate something, such as [classified information], might even use an analog means of doing so, and pre-dates the space shuttle [and Evangelion] by more than just a few years.

                Definitely a good idea for critical calculations, though.

                1. martinusher Silver badge

                  Re: Error detection

                  Two out of three redundancy is as old as the hills. It can be made a bit more reliable by having different systems arrive at the result -- instead of three (or more) identical boxes you distribute the work among different systems so that the likelihood of an error showing up in more than one system is minimized.

                  The problem with this sort of approach is not just bulk but time -- like any deliberative process you have to achieve a consensus to do anything which inevitably delays the outcome.

              2. General Purpose Bronze badge

                Re: Error detection

                Something like this?

                During timecritical mission phases (i.e., recovery time less than one second), such as boost, reentry, and landing, four of these computers operate as a redundant set, receiving the same input data, performing the same flight-critical computations, and transmitting the same output commands.(The fifth computer performs non-critical computations.) In this mode of operation, comparison of output commands and “voting” on the results in the redundant set provide the basis for efficient detection and identification of two flight-critical computer failures. After two failures, the remaining two computers in the set use comparison and self-test techniques to provide tolerance of a third fault.

      2. YARR

        Re: Error detection

        The problem with a 3rd tiebreaker is that there’s about a 1/n^2 probability of that also being ‘mercurial’ and favouring the corrupt core over the working one.

        1. FeepingCreature

          Re: Error detection

          Well, only if they fail in the same way. Which given there is only one correct answer but a near unlimited number of wrong answers, seems quite unlikely.

          1. EnviableOne Silver badge

            Re: Error detection

            when the answer is only 0 or 1, there are numerous ways the answer can end up wrong or invalid

            even down to cosmic rays (there are open bugs in cisco equipment with background radiation and em interference as known causes)

    2. cyberdemon Silver badge
      Devil

      Re: Error detection

      Nah, they'll just hide the errors under layer upon inscrutable layer of neural network, and a few arithmetic glitches will probably benefit the model as a whole.

      So instead of being a function of its input and training data, and coming to a conclusion like "black person == criminal" it will say something like "bork bork bork, today's unperson of the day is.. Richard Buttleoyce"

      1. SCP

        Re: Error detection

        Perhaps it should be a

        +++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++

        error.

    3. Natalie Gritpants Jr Silver badge

      Re: Error detection

      You can configure some ARM cores this way and crash on disagreement. Doubles your power consumption and chip area though.

      1. John Robson Silver badge

        Re: Error detection

        Depends on what's happening - it sounds like this is an issue with specific cpus/cores occasionally.

        In which case you only need to halve your processor speed periodically throughout life to pick up any discrepancies.

        1. bombastic bob Silver badge
          Devil

          Re: Error detection

          I dunno about half speed... but certainly limit the operating temperature.

          more than likely it's caused by running at higher than average temperatures (that are still below the limit) which cause an increase in hole/electron migration within the gates [from entropy] and they become weakened and occasionally malfunction...

          (at higher temperatures, entropy is higher, and therefore migration as well)

          I'm guessing that these malfunctioning devices had been run at very high temperatures, almost continuously, for a long period of time [years even]. Even though the chip spec allows temperatures to be WAY hotter than they usually run at, it's probably not a good idea to LET this happen in order to save money on cooling systems (or for any other reason related to this).

          On several occasions I've seen overheated devices malfunction [requiring replacement]. In some cases it was due to bad manufacturing practices (an entire run of bad boards with dead CPUs). I would expect that repeated exposure to maximum temperatures over a long period of time would eventually have the same effect.

    4. SCP

      Re: Error detection

      I believe that sort of architecture (multi-core cross comparison) has already been proposed in the ARM triple-core-lockstep (TCLS). This is an extension to classic lockstep and offers correction-on-the-fly. [Not sure where they are on realization).

    5. Anonymous Coward
      Anonymous Coward

      Re: Error detection

      That is a lot of silicon being dedicated to a problem that can be solved with less.

      It's possible and has been done where I used to work, Sussex uni, to implement error checking in logic gates. A researcher around 20 years ago was generating chip designs for error checking, finding the least number of gates needed, reducing the current, at the time, design size.

      He was back then using a GA to produce needed layouts and found many more efficient ones than was currently in use (provided them free to use). This could be applied to cpus, and is for critical systems, but as it uses more silicon, isn't in consumer cpus as that adds cost for no performance gain.

      1. Anonymous Coward
        Anonymous Coward

        Re: Error detection

        May have been this guy.

        https://users.sussex.ac.uk/~mmg20/index.html

    6. EveryTime

      Re: Error detection

      CPU redundancy has been around almost since the beginning of electronic computing, but it largely disappeared in the early 1990s as caching and asynchronous interrupts made cycle-by-cycle comparison infeasible.

      My expectation is that this will turn out to be another in a long history of misunderstanding faults. It's seeing a specific design error and mistaking it for a general technology limit.

      My first encounter of this was when dynamic RAM was suffering from high fault rates. I read many stories on how the limit of feature size had been reached. The older generation had been reliable, so the speculation was that the new smaller memory capacitors had crossed the threshold where every cosmic ray would flip bits. I completely believed those stories. Then the next round of stories reported that the actual problem was the somewhat radioactive ceramic used for the chip packaging. Using to a different source for ceramic avoided the problem, and it was a motivation to simply change to less expensive plastic packages.

      The same thing happened repeatedly over the years in supercomputing/HPC. Researchers thought that they spotted disturbing trends in the largest installed systems. What they found was always a specific solvable problem, not a general reliability limit to scaling.

    7. Warm Braw Silver badge

      Re: Error detection

      The approach adopted by Tandem Computers was to duplicate everything - including memory and persistent storage -- as you can get bus glitches, cache glitches and all sorts of other transient faults in "shared" components which you would not otherwise be able to detect simply from core coupling. But even that doesn't necessarily protect against systematic errors where every instance of (say) the processor makes the same mistake repeatably.

      It's a difficult problem: and don't forget that many peripherals will also have processors in them, it's not just the main CPU you have to look out for.

    8. Anonymous Coward
      Anonymous Coward

      Re: Error detection

      So maybe it's all an Intel plot to sell three times as much hardware? ...

    9. bombastic bob Silver badge
      Devil

      Re: Error detection

      CPU redundancy may be easier than people may want to admit...

      If your CPU has multiple (actual) cores, for "critical" operations you could run two parallel threads. If your threads can be assigned "CPU affinity" such that they don't hop from CPU to CPU as tasks switch around then you can compare the results to make sure they match. If you're REALLY paranoid, you can use more than 2 threads.

      If it's a VM then the hypervisor (or emulator, or whatever) would need to be able to ensure that core to thread affinity is supported.

    10. Anonymous Coward
      Anonymous Coward

      Re: Error detection and elimination

      "only really critical tasks have had CPU redundancy for detecting and removing errors. "

      Tandem Nonstop mean anything to you?

      Feed multiple nominally identical computer systems the same set of inputs and if they don't have the same outputs something's gone wrong (massively oversimplified).

      Lockstep at IO level rather than instruction level (how does instruction-level lockstep deal with things like soft errors in cache memory, that can be corrected but are unlikely to occur simultaneously on two or more systems being compared).

      Anyway, it's mostly been done before. Just not by the Intel/Windows world.

    11. The Oncoming Scorn Silver badge

      Re: Error detection

      Wheres my minority report!

    12. Tom 7 Silver badge

      Re: Error detection

      Could cause more problems than it solves. If all three cores are close to each other on the die and the error is one of the 'field type' (where lots of certain activity in a certain area of the chip causes the problem) then all three cores could fall for the same problem and provide identical incorrect results thus giving the illusion all is ok,

  3. Fruit and Nutcase Silver badge
    Joke

    The Spanish Inquisition

    "we must extract 'confessions' via further testing"

    Were good at extracting confessions. May be the Google boffins can learn a few techniques from them

    1. Neil Barnes Silver badge

      Re: The Spanish Inquisition

      Fear and surprise are our weapons, and, er, being mercurial...

      1. Paul Crawford Silver badge

        Re: The Spanish Inquisition

        Our three methods are fear, surprise, being mercurial. Oh and an almost fanatical devotion to IEEE 754 Standard for Floating-Point Arithmetic!

        Damn! Among our methods are fear...

        1. Alan Brown Silver badge

          Re: The Spanish Inquisition

          as long as you don't round over multiple iterations (long story behind this comment....)

          1. Ken Moorhouse Silver badge

            Re: (long story behind this comment....)

            You can tell us... over multiple iterations, if you like.

            1. Claptrap314 Silver badge

              Re: (long story behind this comment....)

              (Who started out in floating point validation)

              slowly shakes head...

        2. jmch Silver badge

          Re: The Spanish Inquisition

          We'll come in again

    2. Anonymous South African Coward Silver badge

      Re: The Spanish Inquisition

      Aren't a Mother Confessor and a team of Mord-Siths more practical for getting out confessions?

    3. Irony Deficient Bronze badge

      Maybe the Google boffins can learn a few techniques from them.

      Ximénez: Now, old woman — you are accused of heresy on three counts: heresy by thought, heresy by word, heresy by deed, and heresy by action — four counts. Do you confess?

      Wilde: I don’t understand what I’m accused of.

      Ximénez: Ha! Then we’ll make you understand! Biggles! Fetch … the cushions!

  4. fredesmite2

    fault tolerance

    There is little in fault tolerance and detection anymore .. silent data corruption is more the norm in Intel mass produced servers.

  5. Joe W Silver badge
    Coat

    auto-erratic ransomware

    Oh.

    that's erratic

    (sorry....)

    1. Arthur the cat Silver badge

      Re: auto-erratic ransomware

      Just wait a while. All technology gets used for porn.

      1. J. Cook Silver badge
        Paris Hilton

        Re: auto-erratic ransomware

        I think y'all ment auto-erotic.

        I'll be in my bunk.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021