back to article FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof

Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …

  1. Blackjack Silver badge

    Somehow this will all end with Intel having part of the blame...

    1. imanidiot Silver badge

      Since some parts of modern x86 and x64 chip designs result directly or indirectly from decisions Intel has made in the past, it's likely at least some small part of the blame will lie with Intel. Whether they should have known better (like with the whole IME and predictive threading debacle) remains to be seen.

    2. The Man Who Fell To Earth Silver badge
      Black Helicopters

      Allow one to disable a core

      One band-aide would be to simply stop using a core found to be mercurial, ideally by switching it off, less ideally by having the OS avoid using it.

      1. Cynic_999

        Re: Allow one to disable a core

        But that assumes that an error in a particular core has been detected in the first place. Where errors would have serious consequences, I would say the best policy would be to have at least 2 different cores or CPUs running the same code in parallel, and checking that the results match. Using 3 cores/CPUs would be better and allow the errant device to be detected. Best for each to have their own RAM as well.

        1. HammerOn1024

          Re: Allow one to disable a core

          So much for my power supplies... and power plants... and power infrastructure in general.

        2. SCP

          Re: Allow one to disable a core

          aka Lockstep processing, with Triple Core Lockstep (TCLS) being something proposed by ARM.

      2. AVR Bronze badge

        Re: Allow one to disable a core

        Assuming that it's not effectively running its own private encryption as described in the article. That might require you to switch it back on for a while.

        Also, apparently even finding out that the core is producing errors can be tricky.

        1. Malcolm 5

          Re: Allow one to disable a core

          that comment about encryption that can only be reversed by the same core intrigued me - maybe I am thinking too advanced but is anything other that XOR encryption really so symmetric that a processor bug would hit encryption and decryption "the same"

          I guess it could be that there was a stream calculation was doing the same thing and feeding into an XOR with the data

    3. NoneSuch Silver badge
      Devil

      Out of the Box Thinking

      The grief is probably caused by native NSA authored microcode that needs an update.

    4. Anonymous Coward
      Anonymous Coward

      There will be a logo and a silly name; just wait and see.

  2. Richard Boyce

    Error detection

    We've long had ECC RAM available, but only really critical tasks have had CPU redundancy for detecting and removing errors. Maybe it's time for that to change. As chips have more and more cores added, perhaps we could usefully use an option to tie cores together in threes to do the same tasks with majority voting to determine the output.

    1. AndrewV

      Re: Error detection

      It would be more efficient to only call in the third as a tiebreaker.

      1. yetanotheraoc Silver badge

        Re: Error detection

        As in, appeal to the supreme core?

        1. Ben Bonsall

          Re: Error detection

          As in minority report...

          1. juice

            Re: Error detection

            >As in minority report...

            I think that "voting" concept has been used in a few places - including, if memory serves, the three "Magi" in Neon Genesis Evangelion.

            https://wiki.evageeks.org/Magi

            There's even a relatively obscure story about a Bolo (giant sentient tanks), in which the AI's multi-core hardware is failing, and it has to bring a human along for the ride while fighting aliens, since there's a risk that it'll end up stuck with an even number of "votes" and will need to ask the human to act as a tie-breaker...

            1. J. Cook Silver badge

              Re: Error detection

              ... rumor has it that the Space Shuttle's early avionics and main computer was something akin to a 5 node cluster working on the same calculations, and that the result that at least three of the nodes had to be identical.

              of something like that.

              1. bombastic bob Silver badge
                Devil

                Re: Error detection

                without revealing [classified information] the concept of "2 out of 3" needed to initiate something, such as [classified information], might even use an analog means of doing so, and pre-dates the space shuttle [and Evangelion] by more than just a few years.

                Definitely a good idea for critical calculations, though.

                1. martinusher Silver badge

                  Re: Error detection

                  Two out of three redundancy is as old as the hills. It can be made a bit more reliable by having different systems arrive at the result -- instead of three (or more) identical boxes you distribute the work among different systems so that the likelihood of an error showing up in more than one system is minimized.

                  The problem with this sort of approach is not just bulk but time -- like any deliberative process you have to achieve a consensus to do anything which inevitably delays the outcome.

              2. General Purpose

                Re: Error detection

                Something like this?

                During timecritical mission phases (i.e., recovery time less than one second), such as boost, reentry, and landing, four of these computers operate as a redundant set, receiving the same input data, performing the same flight-critical computations, and transmitting the same output commands.(The fifth computer performs non-critical computations.) In this mode of operation, comparison of output commands and “voting” on the results in the redundant set provide the basis for efficient detection and identification of two flight-critical computer failures. After two failures, the remaining two computers in the set use comparison and self-test techniques to provide tolerance of a third fault.

      2. YARR

        Re: Error detection

        The problem with a 3rd tiebreaker is that there’s about a 1/n^2 probability of that also being ‘mercurial’ and favouring the corrupt core over the working one.

        1. FeepingCreature

          Re: Error detection

          Well, only if they fail in the same way. Which given there is only one correct answer but a near unlimited number of wrong answers, seems quite unlikely.

          1. EnviableOne

            Re: Error detection

            when the answer is only 0 or 1, there are numerous ways the answer can end up wrong or invalid

            even down to cosmic rays (there are open bugs in cisco equipment with background radiation and em interference as known causes)

    2. cyberdemon Silver badge
      Devil

      Re: Error detection

      Nah, they'll just hide the errors under layer upon inscrutable layer of neural network, and a few arithmetic glitches will probably benefit the model as a whole.

      So instead of being a function of its input and training data, and coming to a conclusion like "black person == criminal" it will say something like "bork bork bork, today's unperson of the day is.. Richard Buttleoyce"

      1. SCP

        Re: Error detection

        Perhaps it should be a

        +++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++

        error.

    3. Natalie Gritpants Jr

      Re: Error detection

      You can configure some ARM cores this way and crash on disagreement. Doubles your power consumption and chip area though.

      1. John Robson Silver badge

        Re: Error detection

        Depends on what's happening - it sounds like this is an issue with specific cpus/cores occasionally.

        In which case you only need to halve your processor speed periodically throughout life to pick up any discrepancies.

        1. bombastic bob Silver badge
          Devil

          Re: Error detection

          I dunno about half speed... but certainly limit the operating temperature.

          more than likely it's caused by running at higher than average temperatures (that are still below the limit) which cause an increase in hole/electron migration within the gates [from entropy] and they become weakened and occasionally malfunction...

          (at higher temperatures, entropy is higher, and therefore migration as well)

          I'm guessing that these malfunctioning devices had been run at very high temperatures, almost continuously, for a long period of time [years even]. Even though the chip spec allows temperatures to be WAY hotter than they usually run at, it's probably not a good idea to LET this happen in order to save money on cooling systems (or for any other reason related to this).

          On several occasions I've seen overheated devices malfunction [requiring replacement]. In some cases it was due to bad manufacturing practices (an entire run of bad boards with dead CPUs). I would expect that repeated exposure to maximum temperatures over a long period of time would eventually have the same effect.

    4. SCP

      Re: Error detection

      I believe that sort of architecture (multi-core cross comparison) has already been proposed in the ARM triple-core-lockstep (TCLS). This is an extension to classic lockstep and offers correction-on-the-fly. [Not sure where they are on realization).

    5. Anonymous Coward
      Anonymous Coward

      Re: Error detection

      That is a lot of silicon being dedicated to a problem that can be solved with less.

      It's possible and has been done where I used to work, Sussex uni, to implement error checking in logic gates. A researcher around 20 years ago was generating chip designs for error checking, finding the least number of gates needed, reducing the current, at the time, design size.

      He was back then using a GA to produce needed layouts and found many more efficient ones than was currently in use (provided them free to use). This could be applied to cpus, and is for critical systems, but as it uses more silicon, isn't in consumer cpus as that adds cost for no performance gain.

      1. Anonymous Coward
        Anonymous Coward

        Re: Error detection

        May have been this guy.

        https://users.sussex.ac.uk/~mmg20/index.html

    6. EveryTime

      Re: Error detection

      CPU redundancy has been around almost since the beginning of electronic computing, but it largely disappeared in the early 1990s as caching and asynchronous interrupts made cycle-by-cycle comparison infeasible.

      My expectation is that this will turn out to be another in a long history of misunderstanding faults. It's seeing a specific design error and mistaking it for a general technology limit.

      My first encounter of this was when dynamic RAM was suffering from high fault rates. I read many stories on how the limit of feature size had been reached. The older generation had been reliable, so the speculation was that the new smaller memory capacitors had crossed the threshold where every cosmic ray would flip bits. I completely believed those stories. Then the next round of stories reported that the actual problem was the somewhat radioactive ceramic used for the chip packaging. Using to a different source for ceramic avoided the problem, and it was a motivation to simply change to less expensive plastic packages.

      The same thing happened repeatedly over the years in supercomputing/HPC. Researchers thought that they spotted disturbing trends in the largest installed systems. What they found was always a specific solvable problem, not a general reliability limit to scaling.

    7. Warm Braw

      Re: Error detection

      The approach adopted by Tandem Computers was to duplicate everything - including memory and persistent storage -- as you can get bus glitches, cache glitches and all sorts of other transient faults in "shared" components which you would not otherwise be able to detect simply from core coupling. But even that doesn't necessarily protect against systematic errors where every instance of (say) the processor makes the same mistake repeatably.

      It's a difficult problem: and don't forget that many peripherals will also have processors in them, it's not just the main CPU you have to look out for.

    8. Anonymous Coward
      Anonymous Coward

      Re: Error detection

      So maybe it's all an Intel plot to sell three times as much hardware? ...

    9. bombastic bob Silver badge
      Devil

      Re: Error detection

      CPU redundancy may be easier than people may want to admit...

      If your CPU has multiple (actual) cores, for "critical" operations you could run two parallel threads. If your threads can be assigned "CPU affinity" such that they don't hop from CPU to CPU as tasks switch around then you can compare the results to make sure they match. If you're REALLY paranoid, you can use more than 2 threads.

      If it's a VM then the hypervisor (or emulator, or whatever) would need to be able to ensure that core to thread affinity is supported.

    10. Anonymous Coward
      Anonymous Coward

      Re: Error detection and elimination

      "only really critical tasks have had CPU redundancy for detecting and removing errors. "

      Tandem Nonstop mean anything to you?

      Feed multiple nominally identical computer systems the same set of inputs and if they don't have the same outputs something's gone wrong (massively oversimplified).

      Lockstep at IO level rather than instruction level (how does instruction-level lockstep deal with things like soft errors in cache memory, that can be corrected but are unlikely to occur simultaneously on two or more systems being compared).

      Anyway, it's mostly been done before. Just not by the Intel/Windows world.

    11. The Oncoming Scorn Silver badge

      Re: Error detection

      Wheres my minority report!

    12. Tom 7

      Re: Error detection

      Could cause more problems than it solves. If all three cores are close to each other on the die and the error is one of the 'field type' (where lots of certain activity in a certain area of the chip causes the problem) then all three cores could fall for the same problem and provide identical incorrect results thus giving the illusion all is ok,

  3. Fruit and Nutcase Silver badge
    Joke

    The Spanish Inquisition

    "we must extract 'confessions' via further testing"

    Were good at extracting confessions. May be the Google boffins can learn a few techniques from them

    1. Neil Barnes Silver badge

      Re: The Spanish Inquisition

      Fear and surprise are our weapons, and, er, being mercurial...

      1. Paul Crawford Silver badge

        Re: The Spanish Inquisition

        Our three methods are fear, surprise, being mercurial. Oh and an almost fanatical devotion to IEEE 754 Standard for Floating-Point Arithmetic!

        Damn! Among our methods are fear...

        1. Alan Brown Silver badge

          Re: The Spanish Inquisition

          as long as you don't round over multiple iterations (long story behind this comment....)

          1. Ken Moorhouse Silver badge

            Re: (long story behind this comment....)

            You can tell us... over multiple iterations, if you like.

            1. Claptrap314 Silver badge

              Re: (long story behind this comment....)

              (Who started out in floating point validation)

              slowly shakes head...

        2. jmch Silver badge

          Re: The Spanish Inquisition

          We'll come in again

    2. Anonymous South African Coward Silver badge

      Re: The Spanish Inquisition

      Aren't a Mother Confessor and a team of Mord-Siths more practical for getting out confessions?

    3. Irony Deficient

      Maybe the Google boffins can learn a few techniques from them.

      Ximénez: Now, old woman — you are accused of heresy on three counts: heresy by thought, heresy by word, heresy by deed, and heresy by action — four counts. Do you confess?

      Wilde: I don’t understand what I’m accused of.

      Ximénez: Ha! Then we’ll make you understand! Biggles! Fetch … the cushions!

  4. fredesmite2

    fault tolerance

    There is little in fault tolerance and detection anymore .. silent data corruption is more the norm in Intel mass produced servers.

  5. Joe W Silver badge
    Coat

    auto-erratic ransomware

    Oh.

    that's erratic

    (sorry....)

    1. Arthur the cat Silver badge

      Re: auto-erratic ransomware

      Just wait a while. All technology gets used for porn.

      1. J. Cook Silver badge
        Paris Hilton

        Re: auto-erratic ransomware

        I think y'all ment auto-erotic.

        I'll be in my bunk.

  6. GloriousVictoryForThePeople

    > "The other half is a mix of false accusations and limited reproducibility."

    Perfect for AI facial recognition workloads in Apple stores then.

  7. Moldskred

    "The mega-corp is currently relying on human-driven core integrity interrogation, [...]"

    That sentence sounds a lot more dystopian than it actually is -- which is a nice change of pace when talking about tech companies.

  8. Ken Moorhouse Silver badge

    Complexity: Another nail in the coffin...

    ...for the cloud.

    Before anyone posts the obvious rebuttal: note this phrase "two of the world's larger CPU stressors, Google and Facebook".

    If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time.

    I grew up with the concept of "the clock pulse". If we're pushing synchronous data to the limits (rise-fall time of data wrt clock pulses) then you could arguably get a skew effect. If designers are in denial about that then there are big problems ahead. (Rowhammer is a related problem).

    1. Brewster's Angle Grinder Silver badge

      Not all electrons are made equal...

      To me this sounds like quantum effects. No manufacturing process produces exact replicas; there is going to be subtle variation between chips. I don't know anything about modern chip design and manufacture so can't speculate what it could be. But electron behaviour is just the law of averages. And so whatever these defects are, it means electrons can periodically jump where they shouldn't. The smaller the currents, the fewer party* electrons are needed for this to become significant.

      * The party number is one of the important quantum numbers. It determines how likely an electron is to be an outlier. It's normally represented as a mullet.

      1. Anonymous Coward
        Anonymous Coward

        Re: Not all electrons are made equal...

        It's not quantum effects. Just process variation. These are factored in when closing the design. No teo devices will be the same. There is a spread. Some manufacturers will then use binning to grade parts by performance.

        1. Brewster's Angle Grinder Silver badge

          Forbidden gates

          But these are process variations that are being missed by manufacturers and where the chip generally functions as required. Just every once in a while it goes haywire. You could call it fate. You could call it luck. You could call it Karma. You could say it's mercurial or capricious. Or you could suspect some process variation allows tunnelling with low probability, or that some other odd transition or excitation is happening.

          1. Anonymous Coward
            Anonymous Coward

            Re: Forbidden gates

            No it doesn't really work like that.

            But there could be IRdrop, crosstalk or local heating issues. But all of these should be analysed during chip implementation and verification.

          2. Anonymous Coward
            Boffin

            Re: Forbidden gates

            It's just down to the statistics of very rare events with very large N. If you have a reliable processor with a clock speed of 10^9 hertz that gives you just one error every 10^20 clocks, then you can expect an error every 3000 years or so, say a one in five hundred or a thousand chance of seeing a single error during the 3-6 year life of the system. I can live with that for my laptop.

            But if you buy a million of those processors and run them in parallel in data centres then you will see roughly an error every day.

            1. General Purpose

              Re: Forbidden gates

              The trouble is that those errors aren't evenly spread. Specific individual cores go bad. The chances are against you having one of those in your laptop or one of your on-premises server, but if you do have one then you may experience a series of mysterious crashes, incorrect calculations and/or data loss, not just one incident.

      2. Roland6 Silver badge

        Re: Not all electrons are made equal...

        >To me this sounds like quantum effects.

        And to me too.

        Also shouldn't rule out cosmic radiation and other particles of interest that normal pass through stuff, yet given a sufficiently large sample will hit something...

    2. Cuddles

      Re: Complexity: Another nail in the coffin...

      "If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time."

      I don't think it's anything to do with CPU time, but simply the number of CPUs. As the article notes, it's a few problematic cores per several thousand CPUs, ie. it's not random failures due to the large amount of use, it's some specific cores that have a problem. But since the problems are rare, only people operating many thousands of them are likely to actually encounter them. So it's a bit misleading to call them "stressors" of CPUs; it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs.

      So it's hard to say if on-prem would be better or not. On the one hand, you're unlikely to have enough CPUs to actually have a problem. But if you get unlucky and you do, the problematic core will be a greater percentage of your computing, and you're unlikely to be able to actually spot it at all. On the other hand, being assigned different CPUs every time you run a task in the cloud makes it almost inevitable that you'll encounter a troublesome core at some point. But it's unlikely to be a persistent problem since you won't have the same core next time, and the companies operating at that scale are able to assign the resources to actually find the problem.

      1. AndrewB57

        Re: Complexity: Another nail in the coffin...

        I **think** that means that there is no difference in the chance of corruption as experienced at a local level.

        Then again, 15% of statisticians will ALWAYS disagree with the other 90%

        1. Will Godfrey Silver badge
          Happy

          Re: Complexity: Another nail in the coffin...

          I see what you did there!

      2. John Brown (no body) Silver badge
        Devil

        Re: Complexity: Another nail in the coffin...

        "happen to use a lot of CPUs"

        What about GPUs, I wonder? Should the big crypto miners be getting concerned about now?

        1. Jon 37 Silver badge

          Re: Complexity: Another nail in the coffin...

          No, because of the way crypto is designed. Any miner who tries to submit a mined block, will have it tested by every other node on the network. If the miner's system glitched, then the block just won't be accepted. And this sounds rare enough that a miner would just shrug and move onto the next block.

      3. stiine Silver badge

        re: Cuddles: Re: Complexity: Another nail in the coffin...

        But that doesn't explain why they had crypto algorithms that would only decrpyx on the same cpu andcore that they were originally encrypted on.

        1. Jon 37 Silver badge

          Re: re: Cuddles: Complexity: Another nail in the coffin...

          "Stream ciphers", one of the common kinds of encryption algorithm, work by taking a key and generating a long string of pseudo-random numbers from that key. That then gets XOR'd into the data.

          It's the same algorithm to encrypt and to decrypt. (Like how ROT13 is the same algorithm to encrypt and to decrypt, except a lot more secure).

          So it's certainly possible that a core bug results in the specific sequence of instructions in the pseudo-random-number generator part giving the wrong answer. And it's certainly possible that is reproducible, repeating it with the same key gives the same wrong answer each time.

          That would lead to the described behaviour - encrypting on the buggy core gives a different encryption from any other core, so only the buggy core can decrypt it.

        2. bombastic bob Silver badge
          Devil

          Re: re: Cuddles: Complexity: Another nail in the coffin...

          maybe they need to use an encryption algorithm that isn't susceptible to (virtually) identical math errors during encryption and decryption. Then you could self-check by decrypting the encrypted output and comparing to the original. So long as the errors produce un-decryptable results, you should be fine.

      4. Michael Wojcik Silver badge

        Re: Complexity: Another nail in the coffin...

        it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs

        Well, it's also about how much of the time a given CPU (or rather each of its cores) is being used, since that's what gives you a result that might be incorrect. If a company "uses" a million cores but a given core is idle 90% of the time, they'll be much less likely to encounter a fault, obviously.

        So while "stressing" is probably not really an accurate term – it's not like they're using the CPUs outside their documented envelope (AFAIK) – "using more or less constantly" is a relevant qualification.

  9. Smudged
    Terminator

    Evolution of the microchip

    So Google have witnessed chips evolving to produce their own ransomware capability. What next? How far from Skynet are we?

  10. amanfromMars 1 Silver badge

    Just the cost of doing such business.

    The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests.

    If you consider chips as not too dissimilar from the networking of smarter humans, emerging anomalies are much easier to understand and be prepared for and accepted as just being an inherent endemic glitch always testing novel processes and processing there is no prior programming for.

    And what if they are not simply errors but other possibilities available in other realities/times/spaces/virtually augmented places?

    Are we struggling to make machines more like humans when we should be making humans more like machines….. IntelAIgent and CyberIntelAIgent Virtualised Machines?

    Prime Digitization offers Realisable Benefits.

    What is a computer other than a machine which we try to make think like us and/or for us? And what other model, to mimic/mirror could we possibly use, other than our own brain or something else SMARTR imagined?

    And if through Deeper Thought, our Brain makes a Quantum Leap into another Human Understanding such as delivers Enlightened Views, does that mean that we can be and/or are Quantum Computers?

    And is that likely to be a Feared and/or AWEsome Alien Territory?

    1. OJay

      Re: Just the cost of doing such business.

      for once, I was able to follow this train of thought until the end.

      So, what does that make me. Mobile Autonomous quaNtum unit?

      1. amanfromMars 1 Silver badge

        Re: Just the cost of doing such business.

        for once, I was able to follow this train of thought until the end.

        So, what does that make me. Mobile Autonomous quaNtum unit? ..... OJay>

        Gifted is a viable and pleasant thought, OJay, and would not be at all presumptuous. :-)

  11. Anonymous Coward
    Coat

    That's mercurial as in unpredictable, not Mercurial as in the version control system of the same name.

    So what you're really saying is the version control system of the same name was aptly named.

  12. Anonymous Coward
    Anonymous Coward

    Once upon a time.....way back in another century......

    ......some of us (dimly) remember the idea of a standard development process:

    1. Requirements (how quaint!!!)

    2. Development

    3. Unit Test

    4. Functional Test

    5. Volume Test (also rather quaint!!)

    6. User Acceptance Test (you know...against item#1)

    .....where #4, #5 and #6 might overlap somewhat in the timeline.

    Another old fashioned idea was to have two (or three) separate installations (DEV, USER, PROD).......

    ......not sure how any of this old fashioned, twentieth century thinking fits in with "agile", "devops", "cloud"....and other "advanced" twenty first century thinking.

    ......but this article certainly makes this AC quite nostalgic for days past!

    1. Ken Moorhouse Silver badge

      Re: 5. Volume Test (also rather quaint!!)

      Every gig I've ever attended they never ever got past 2.

      1. Arthur the cat Silver badge

        Re: 5. Volume Test (also rather quaint!!)

        Every gig I've ever attended they never ever got past 2.

        <voice accent="yorkshire">You were lucky!</voice>

    2. Anonymous South African Coward Silver badge

      Re: Once upon a time.....way back in another century......

      In the Elder Days, when things was Less Rushed, sure, you could take your time with a product, and deliver a product that lived up to its promises.

      Nowadays in these Younger Days everything is rushed to market (RTM) after a vigorous spit 'n polish and sugarcoating session to hide most of Them Nasteh Buggreh Bugs. And nary a peep of said TNBB's either... hoping said NTBB's won't manifest themselves until closer to the End Lifetime of the Product.

      Case in point - MCAS.

      ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.

      1. Version 1.0 Silver badge
        Thumb Up

        Re: Once upon a time.....way back in another century......

        I never saw any problems with an 8080, 8085, 8048, or Z80 that I didn't create myself and fixed as soon as I saw the problem. Processors used to be completely reliable until the marketing and sales department start to want to add "features" which have lead to all of today's issues.

      2. Arthur the cat Silver badge

        Re: Once upon a time.....way back in another century......

        ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.

        On the other hand, back when, a friend of mine remarked that the Sinclair Scientific calculator was remarkably egalitarian, because if you didn't like the answer it gave you, you just had to squeeze the sides and it would give you a different one.

      3. SCP

        Re: Once upon a time.....way back in another century......

        ASAC wrote:

        "ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption."

        Is that with or without the pint of milk on top to keep it cool enough?

    3. Primus Secundus Tertius

      Re: Once upon a time.....way back in another century......

      AC has described the ideal case.

      In practice, there were repeats of item 1 between items 2 and 3, 3 and 4, etc. Table-thumping customer managements and toadying contractor sales people.

      (S)He also omits a necessary step between 1 and 2, namely the software design. The requirements stated what was thought to be required - not always a correct piece of analysis. The software design says how you get there in terms of data structures and algorithms. Once software got past transcribing maths into FORTRAN the SD was essential.

      For CPUs, replace software with microcode. This was even more problematical than orthodox code.

    4. Doctor Syntax Silver badge

      Re: Once upon a time.....way back in another century......

      7. Use in production.

      It's only in 7, and even then only at large scale that rare, sporadic failures become recognisable. Even if you were lucky enough to catch one at the previous stages you wouldn't be able to reproduce it reliably enough to understand it.

  13. Pascal Monett Silver badge

    "misbehaving cores"

    Is the solution of replacing the CPU with another identical not a good idea, or will the new one start misbehaving in the same way ?

    The article states that Google and FaceBook report a few cores in a thousand. That means that most CPUs are functioning just fine, so rip out the mercurial CPUs and replace them. That should give a chance of solving the immediate issue.

    Of course, then you take the misbehaving CPU and give it a good spanking, euh, put it in a test rig to find out just how it fails.

    1. Blank Reg

      Re: "misbehaving cores"

      Replacing CPUs is the easy part. Detecting that a CPU needs replacing because it makes a mistake once every ten thousand hours is the hard part.

      1. Neil Barnes Silver badge

        Re: "misbehaving cores"

        Paraphrasing the jackpot computer in Robert Scheckley's Dimension of Miracles: I'm allowed to make one mistake in ten million and therefore not only I am going to, but I have to.

    2. Ken Moorhouse Silver badge

      Re: "misbehaving cores"

      The question is whether this is at the CPU level, the board level, the box level, system level. Tolerances* for all of these things gives rise to unacceptable possibilities - don't forget at the board/box level you've got power supplies and, hopefully UPS's attached to those. How highly do these data centres/centers rate these seemingly mundane sub assemblies, for example? (I'm sure many of us here have had experiences with slightly wayward PSU's).

      *The old-fashioned "limits and fits" is to my mind a better illustration of how components work with each other.

  14. Red Ted
    Happy

    SETI saw result corruption too

    The SETI project used to see work units with corrupted results and they double checked all results.

    They attributed it to cosmic rays striking the micro and causing a bit flip.

    1. Anonymous Coward
      Alien

      They attributed it to cosmic rays striking the micro and causing a bit flip.

      It was just aliens hiding their presence. But they did on the checks too.

    2. Anonymous Coward
      Anonymous Coward

      Re: SETI saw result corruption too

      IIRC SETI also noticed that a lot of the corrupted results came from CPUs that had been overclocked.

    3. Wokstation

      Neutrinos!

      They occasionally bump stuff and can flip a bit - we're building more and more surface area of microchip, so it's only natural that neutrino hits would be proportionally more common.

  15. Brewster's Angle Grinder Silver badge

    Poacher turned gamekeeper

    I think they should hire that ransomware core as a cryptographer.

  16. Anonymous Coward
    Anonymous Coward

    Can we have ECC RAM supported by regular chipsets, please. Like we had certainly in the late 90's / early 2000's off the shelf. The sheer quantity of RAM and reduced tolerance to radiation means probability of bitflips are rather greater today than before.

    Either AMD or Intel could put support back into consumer chipsets as an easy way to get an edge over competitors.

    Regarding, CPU's, there's a reason satellite manufacturers are happy using a 20 year old architecture and manufacturing process at 200nm. Lower vulnerability to radiation-induced errors. (And using SRAM rather than DRAM too for same reason). Performance, cost, "tolerable" error. Rather less practical to roll back consumer performance (unless you fancy getting some genuinely efficient software out in circulation).

    1. Claptrap314 Silver badge

      I'm sure that F & G are using ECC ram already. It's always been out there, but the marginal cost has been enough that (usually) retail consumers avoid it. But I recall it from the '80's.

      1. Roland6 Silver badge

        >retail consumers avoid it.

        Err I think you'll find it is the manufacturers who avoid the use of ECC supporting chipsets in consumer products.

        1. Claptrap314 Silver badge

          Retail consumers avoid the cost that manufacturers charge for it. So...sure.

    2. osmarks

      AMD does, I believe, but don't officially label it as supported.

  17. Wolfclaw

    Pointless having a third core deciding on the tie breaker, the losing cores will simply call in the lawyers to overthrow the result and demand a recount.

  18. dinsdale54

    I have worked for a few hardware companies over the years and every single one has at some point had issues with random errors causing system crashes at above designed rates - these were all bit-flip errors.

    In each case the people who noticed first were our biggest customers. In one of these cases they way they discovered the problem was products from two different companies exhibiting random errors. A quick look at both motherboards showed the same I/O chipset in use. Radioactive contamination in the chip packaging was the root cause.

    You can mitigate these by putting multi-layer parity and ECC on every chip, bus and register with end-to-end checksumming. That will turn silent data corruption in to non-silent but it's also really expensive.

    But at least let's have ECC as standard!

    1. autopoiesis

      Radioactive contamination in the chip packaging - that's intriguing. I take it that wasn't determined during the 'quick look' phase ;)

      Nice find in any case - what was the contaminant, who found it, how long did it take etc?

      1. dinsdale54

        I forget the exact details - this was over 10 years ago - but IIRC systems that had generated these errors were put in a radiation test chamber and radioactivity measured. Once you have demonstrated there's a problem then it's down to the chipset manufacturer to find the issue. I think it was just low level contamination in the packaging material that occasionally popped out an Alpha particle and could flip a bit.

        The remediation is a massive PITA. I think we were dealing with it for about 2 years from initial high failure rates to having all the faulty systems replaced.

        Over the years I have spent far more of my career dealing with these issues than I would like. I put in a big shift remediating Seagate MOOSE drives that had silent data corruption as well.

  19. Ilsa Loving

    Minority report architecture

    The only way to be sure would be to have at least 2 cores doing the same calculation each time. If they disagreed, run the calculation again. Alternatively you could have 3 cores doing the same calculation and if there's one core wrong then majority wins.

    Or we finally move to a completely new technology like maybe optical chips.

  20. steelpillow Silver badge
    Happy

    Collecting rarities

    "Rarely seen [things] crop up frequently"

    I love that so much, I don't want it to be sanity-ised.

    1. David 132 Silver badge

      Re: Collecting rarities

      AKA “Million to one chances crop up nine times out of ten”, as Sir Pterry put it…

  21. Teejay

    Brazil, here we come...

    Tuttle Buttle coming nearer.

    1. bazza Silver badge

      Re: Brazil, here we come...

      How are your ducts?!

  22. bsimon

    cosmic rays

    Well, I used to blame "cosmic rays" for bugs in my code, now I have a new excuse: "mercurial cores" ...

  23. Anonymous Coward
    Anonymous Coward

    Sheer volume of data might also be part of the issue. I'm reminded of a comment from a Reg reader who worked somewhere that processed massive data sets. He said something like "million-to-one shots happen every night for us".

  24. Persona Silver badge

    Networks can misbehave too

    I've seen it happen to network traffic too. With data being sent on a network going half way around the world occasionally a few bits got changed. The deep analysis showed that interference was hitting part of the route and the network error detection was doing its job, detecting it and getting it resent. Very very very rarely the interference corrupted enough bits to pass the network level error check.

    1. Anonymous Coward
      Anonymous Coward

      Re: Networks can misbehave too

      One remote comms device was really struggling - and the transmission errors generated lots of new crashes in the network controller. The reason was the customer had the comms cable running across the floor of their arc welding workshop.

      A good test of a comms link was to wrap the cable a few times round a hair dryer - then switch it on. No matter how good the CRC - there is always a probability of a particular set of corrupt data passing it.

  25. Claptrap314 Silver badge

    Been there, paid to do that

    I did microprocessor validation at AMD & IBM for a decade about two decades ago.

    I'm too busy to dig into these papers, but allow me to lay out what this sounds like. AMD never had problems of this sort while I was there (the validation team Dave Bass built was that good--by necessity.)

    Many large customers of both companies found it worthwhile to have their own validation teams. Apple in particular had validation team that was frankly capable of finding more bugs than the IBM team did in the 750 era. (AMD's customers in the 486 & K5 era would tell them about the bugs that they found in Intel's parts & demand that we match them.)

    Hard bugs are the ones that don't always happen--you can execute the same stream of instructions multiple times & get different results. This is almost certainly not the case for the "ransomware" bug. This rules out a lot of potential issues, including "cosmic rays" and "the Earth's magnetic field". (No BOFHs.)

    The next big question is whether these parts behave like this from the time that they were manufacturers, or if they are the result of damage that accumulates during the lifetime of any microprocessor. Variations in manufacturing process can create either of these. We run tests before the dies are cut to catch the first case. For the latter, we do burn-in tests.

    My first project at IBM was to devise a manufacturing test to catch a bug reported by Nintendo in about 3/1000 parts (AIR) during the 750 era. They wanted to find 75% of the bad parts. I took a bit longer than they wanted to isolate the bug, but my test came out 100% effective.

    My point is that this has always been an issue. Manufacturers exists to make money. Burn-in tests are expensive to create--and even more expensive to run. You can work with your manufacturer about these issues or you can embarrass them. Sounds like F & G are going for the latter.

    Oh, and I'm available. ;)

  26. pwjone1

    Error Checking and modern processor design

    To some degree, undetected errors are to be more expected as chip lithography evolves (14nM to 10nM to 7nM to 6 and 5 or 3nM). There is a history of dynamic errors (crosstalk, XI, and other causes), and the susceptibility to these gets worse as the device geometries get smaller -- just fewer electrons that need to leak. Localized heating also becomes more of a problem the denser your get. Obviously Intel has struggled to get to 10nM, potentially also a factor. But generally x86 (and Atom) processor designs have not had much error checking, the design point is that the cores "just work", and as that gradually has become more and more problematical, it may be that Intel/AMD/Apple/etc. will need to revisit their approach. IBM, on higher end servers (z), generally includes error checking, this is done via various techniques like parity/ECC on internal data paths and caches, predictive error checking (for example, on a state machine, you predict/check the parity or check bits on the next state), and in some cases full redundancy (like laying down two copies of the ALU or cores, comparing the results each cycle). To be optimal, you also need some level of software recovery, and there are varying techniques there, too. Added error checking hardware also has its costs, generally 10-15% and a bit of cycle time, depending on how much you put in and how exactly it is in implemented. So in a way, it is not too much of a surprise that Google (and others) have observed "Mercurial" results, any hardware designer would have shrugged and said "What did you expect? You get what you pay for."

  27. Draco
    Big Brother

    What sort of bloody dystopian Orwellian-tak is this?

    "Our adventure began as vigilant production teams increasingly complained of recidivist machines corrupting data," said Peter Hochschild.

    How were the machines disciplined? Were they given a warning? Did the machines take and pass the requisite unconscious bias training courses?

    "These machines were credibly accused of corrupting multiple different stable well-debugged large-scale applications. Each machine was accused repeatedly by independent teams but conventional diagnostics found nothing wrong with them."

    Did the machines have counsel? Were these accusations proven? Can the machines sue for slander and libel if the accusations are shown to be false?

    -----------

    English is my second language and those statements are truly mind numbing to read. In a certain context, they might be seen as humorous.

    What is wrong with the statements being written something more like:

    "Our adventure began as vigilant production teams increasingly observed machines corrupting data," said Peter Hochschild.

    "These machines were repeatedly observed corrupting multiple different stable well-debugged large-scale applications. Multiple independent teams noted corruptions by these machines even though conventional diagnostics found nothing wrong with them."

    1. amanfromMars 1 Silver badge

      Re: What sort of bloody dystopian Orwellian-tak is this?

      Nice one, Draco. Have a worthy upvote for your contribution to the El Reg Think Tank.

      Transfer that bloody dystopian Orwellian-tak to the many fascist and nationalistic geo-political spheres which mass multi media and terrifying statehoods are responsible for presenting, ...... and denying routinely they be held accountable for ....... and one of the solutions for machines to remedy the situation is to replace and destroy/change and recycle/blitz and burn existing prime established drivers /seditious instruction sets.

      Whether that answer would finally deliver a contribution for retribution and recalibration in any solution to the question that corrupts and perverts and subverts the metadata is certainly worth exploring and exploiting any time the issue veers towards destructively problematical and systemically paralysing and petrifying.

      Some things are just turned plain bad and need to be immediately replaced, old decrepit and exhausted tired for brand spanking new and tested in new spheres of engagement for increased performance and reassuring reliability/guaranteed stability.

      Out with the old and In with the new for a new dawn and welcoming beginning.

  28. Michael Wojcik Silver badge

    ObIT

    That's mercurial as in unpredictable, not Mercurial as in delay lines.

  29. Nightkiller

    Even CPUs are sensitive to Critical Race Theory. At some point you have to expect them to share their lived experience.

  30. Claptrap314 Silver badge

    Buggy processors--that work!

    The University of Michigan made a report (with the above title) around 1998 regarding research that they had done with a self-checking microprocessor. Thier design was to have a fully out-of-order processor do the computations and compare each part of the computation against an in-order core that was "led" by the out-of-order design. (The entire in-order core could be formally validated.) When there was a miscompare, the instruction would be re-run through the in-order core without reference to the "leading" of the out-of-order core. In this fashion, bugs in the out-of-order core would be turned into slowdowns.

    In order for the design to function appropriately, the in-order core required a level-0 cache. The result was that overall execution speed actually increased slightly. (AIR, this was because the out-of-order core was permitted to fetch from the L0 cache.)

    The design did not attract much attention at AMD. I assume that was mostly because our performance was so much beyond what our team believe this design could reach.

    Sadly, such a design does nothing to block SPECTER-class problems.

    In any event, the final issue is cost. F & G are complaining about how much cost they have to bear.

    1. runt row raggy

      Re: Buggy processors--that work!

      i must be missing something. if checking is required, isn't the timing limited to how fast the slowest (in-order) processor can work?

      1. Claptrap314 Silver badge

        Re: Buggy processors--that work!

        You would think so, wouldn't you?

        The design broke up an instruction into parts--instruction fetch, operand fetch, result computation, result store. (It's been >20 years--I might have this wrong.) The in-order core executed these four stages in parallel. It could do this because of the preliminary work of the out-of-order processor. The out-of-order core might take 15 cycles to do all four steps, but the in-order core does it in one--in no small part due to that L0. The in-order core was being drafted by the out-of-order core to the point that it could manage a higher IPC than the out-of-order core--as long as the data was available, which it often was not, of course.

  31. bazza Silver badge

    Sigh...

    I think there's a few people who need to read up on the works of Claude Shannon and Benoit Mandelbrot...

  32. DS999 Silver badge

    How do they know this is new?

    They only found it after a lot of investigation ruled out other causes. It may have been true in years past but no one had enough CPUs running the same code in the same place for long enough that they could tease out the root cause.

    So linking it to today's advanced processes may point us in the wrong direction, unless we can say for sure this wasn't happening with Pentium Pros and PA-RISC 8000s 25 years ago.

    I assume they have sent some of the suspect CPUs to Intel for them to take an electron microscope to look at the cores that exhibit the problem so they try to determine if it is some type of manufacturing variation, "wear", or something no one could prepare for like a one in a septillion neutrino collision with a nucleus changing the electrical characteristics of a single transistor by just enough that an edge condition error affecting that transistor becomes a one in a quadrillion chance?

    If they did, and Intel figures out why those cores go bad, will it ever become public? Or will Google and Intel treat it as a "competitive advantage" over others?

    1. SCP

      Re: How do they know this is new?

      I can't recall a case related to CPUs, but there were definitely cases like pattern sensitive RAM; resolved when the root cause was identified and design tools modified to avoid the issue.

      The "good old days" were not as perfect as our rose tinted glasses might lead us to recall.

      1. Claptrap314 Silver badge

        Re: How do they know this is new?

        We had a case of a power signal coupling a high bit in an address line leading out of the L1 in the 750. Stopped shipping product to Apple for a bit. Nasty, NASTY bug.

        I don't recall what exactly was the source of the manufacturing defect was on the Nintendo bug, but it only affected certain cells in the L2. Once you knew which ones to hit, it was easy to target them. Until I worked it out, though... Uggh.

    2. bazza Silver badge

      Re: How do they know this is new?

      Silicon chips do indeed wear. There's a phenomenon I've heard termed "electron wind" which causes the atoms of the element used to dope the silicon (which is what makes a junction) to be moved across that junction. Eventually they disperse throughout the silicon and then there's no junction at all.

      This is all related to current, temperature and time. More of any these makes the wearing effect faster.

      Combine the slowness with which that happens, and the effects of noise, temperature and voltage margins on whether a junction is operating as desired, and I reckon you can get effects where apparently odd behaviour can be quasi stable.

      1. Claptrap314 Silver badge
        Happy

        Re: How do they know this is new?

        I deliberately avoided the term "hole migration" because it tends to cause people's heads to explode, but yeah.

        And not just quasi-stable. The effects of hole migration can be VERY predictable. Eventually, the processor becomes inert!

  33. Anonymous Coward
    Anonymous Coward

    what data got corrupted, exactly?

    While the issue of rogue cores is certainly important, since they could possibly do bad things to useful stuff (e.g., healthcare records, first-response systems), I wonder if this will pop up later in a shareholders meeting about why click-throughs (or whatever they use to price advert space) are down. "Um, it was, ahh, data corruption, due to, ahm, processor cores, that's it."

  34. Bitsminer Silver badge

    Reminds of silent disk corruption a few years ago

    Google Peter Kelemen at CERN; he was part of a team that identified a high rate of disk data corruption amongst the thousands of rotating disk drives at CERN. This was back in 2007.

    Among the root causes, the disks were a bit "mercurial" about writing data into the correct sector. Sometimes it got written to the wrong cylinder, track, and block. That kind of corruption, even on a RAID5 set, results in a write-loss of correct data that ultimately can invalidate a complete file system.

    Reasoning is as follows: write a sector to the wrong place (data), and write the matching RAID-5 parity to the right place. Later, read the data back and get a RAID-5 parity error. To fix the error, rewrite the (valid, but old) data back in place because the parity is a mismatch. Meanwhile, the correct data at the wrong place lives on. When that gets read, the parity error is detected and the original (and valid) data is rewritten. The net net of this: loss of the written sector. If this is file system metadata, it can break the file system integrity.

    1. bazza Silver badge

      Re: Reminds of silent disk corruption a few years ago

      Yes I remember that.

      It was for reasons like this that Sun developed the ZFS file system. It has error checking and correction up and down the file system, designed to probably give error free operations over exabyte filesystems.

      Modern storage devices are close to being such that if you read the whole device twice, you will not get the same bits returned. 1 will be wrong.

  35. captain veg Silver badge

    so...

    ... we've already got quantum computing, and no one noticed?

    -A.

    1. LateAgain

      Re: so...

      We may, or may not, have noticed.

    2. Claptrap314 Silver badge
      Trollface

      Re: so...

      You do know what a transistor is, correct?

  36. Throatwarbler Mangrove Silver badge
    FAIL

    For shame!

    No one has noticed the opportunity for a new entry in the BOFH Excuse Calendar?

    1. Claptrap314 Silver badge

      Re: For shame!

      Just another page for the existing one, my friend.

  37. itzman

    well its obvious

    that as a data one gets represented by e,g, less and less electrons, statistically the odd data one will fall below the threshold for being a one, and become a zero, and vice versa.

    or a stray cosmic ray will flip a flop.

    or enough electrons will tunnel their way to freedom....

  38. rcxb Silver badge

    Higher datacenter temperatures contributing?

    One has to wonder if the sauna-like temperatures Google and Facebook are increasingly running their datacenters at, is contributing to the increased rate of CPU-core glitches.

    They may be monitoring CPU temperatures to ensure they don't exceed the spec sheet maximums, but any real-world device doesn't have a vertical cliff dropoff, and the more extreme conditions it operates in, the sooner some kind of failure can be expected. The speedometer in my car goes significantly into the tripple-digits, but I wouldn't be shocked if driving it like a race car would result in mechanical problems rather sooner in its life-cycle.

    Similarly, high temperatures are frequently used to simulate years of ageing with various equipment.

    1. Claptrap314 Silver badge

      Re: Higher datacenter temperatures contributing?

      I was never privileged to tour our datacenters, but I am HIGHLY confident that G is careful to run the chips in-spec. When you've got that many millions of processors in the barn, a 1% failure rate is E.X.P.E.N.S.I.V.E.

      Now, for decades, that spec is a curve and not a point. (IE: don't run over this speed at this temperature, or that speed at that temperature) This means that "in spec" is a bit more broad the naive' approach might guess.

      They also have temperature monitors to trigger a shutdown if the temp spikes too high for too long. They test these monitors.

  39. Anonymous Coward
    Anonymous Coward

    Um. Clockspeed

    Surprised no one has mentioned

    As a core ages it will struggle to maintain high clocks and turbo speeds

    For A core that was marginal when new but passed initial Val it's not surprising to see it start to behave unpredictability as it gets older. You see it all the time in overclocked systems, but normally the CPU is running an OS so it'll BSOD on a Driver before it starts to actually make serious mistakes in an app. A lone core running compute in a server it's not surprising that it'll start to misstep and would align with their findings

    Identify mercurial cores and drop.the clocks by 10%, not rocket science

    1. Ken Moorhouse Silver badge

      Re: Um. Clockspeed. Surprised no one has mentioned

      I have mentioned both Clock Skew and the interaction of tolerances ("Limits and Fits") between components/sub-assemblies in this topic earlier.

      ===

      The problem with Voting systems is that the integrity of Clock systems has to be complete. The Clock for all elements of the Voting system has to be such that there is no chance that the results from one element are received outside of the clock period to ensure they are included in this "tick's" vote. If included in the next "tick's" vote then not only does it affect the result for this "tick", but the next "tick" too, which is susceptible to a deleterious cascade effect. I'm assuming that it is prudent to have three separate teams of developers, with no shared libraries, for a 2 in 3 voting system to eliminate the effect of common-mode design principles, which might fail to fault on errors.

      If applying Voting systems to an asynchronous system, such as TCP/IP messaging (where out-of-band packet responses are integral to the design of the system), how do you set time-outs? If they are set too strict then you get the deleterious snowball effect, bringing down the whole system. Too slack and you might just as well use legacy technology.

  40. Snowy Silver badge
    Coat

    Google and Facebook designed CPUs

    Google and Facebook started to use CPUs that they had designed themselves are these the ones that are producing errors?

    1. Claptrap314 Silver badge

      Re: Google and Facebook designed CPUs

      As far as I can tell, those really are more like SOCs and/or GPUs. And, no, they are complaining about someone else's work.

  41. They call me Mr Nick
    FAIL

    Floating Point Fault

    Many years ago while at University I heard of an interesting fault.

    A physicist had run the same program a few days apart but had got different results. And had noticed.

    Upon investigation it transpired that the floating point unit of the ICL mainframe was giving incorrect answeres in the 4th decimal place.

    This was duly fixed. But I wondered at the time how many interesting discoveries in physics were actually undiscovered hardware failures.

    1. Ken Moorhouse Silver badge

      Re: Floating Point Fault. had got different results

      That makes it nondeterministic, which is subtley different to giving incorrect (yet consistent) answers in the 4th decimal place.

      Maybe the RAM needed to be flushed each time the physicist's program was run. Perhaps the physicist was not explicitly initialising variables before use.

      1. Doctor Syntax Silver badge

        Re: Floating Point Fault. had got different results

        "Perhaps the physicist was not explicitly initialising variables before use."

        FORTRAN -

        SOMEWARIABLE = 0

        Later on

        SOMEVARIABLE = SOMEVARIABLE + X

        1. Ken Moorhouse Silver badge

          Re: SOMEWARIABLE = 0

          One of the reasons I like Delphi so much is that cannot easily happen without ignoring compilation warnings, and keeping global declarations to the absolute minimum.

        2. Anonymous Coward
          Anonymous Coward

          Re: Floating Point Fault. had got different results

          Python, for all of it's usefulness, weak type checking is a major bugbear. Especially when you have a couple thousand variable names to work with.

    2. Claptrap314 Silver badge

      Re: Floating Point Fault

      This is why science involves others reproducing your work.

  42. John Savard

    Somebody thought of this before

    Of course, though, IBM has been designing their mainframe CPUs with built-in error detection all along. And it still does, even though they're on microchips.

    1. Claptrap314 Silver badge

      Re: Somebody thought of this before

      That's mostly a different issue. But yeah, some of those server parts have more silicon to detect problems (not just errors) than is in the cores.

  43. Paddy
    Pint

    You get what you paid for.

    Doctor, doctor, it hurts if I use this core!

    Then use one of your millions of others?

    But I need all that I buy!

    Then buy safer, ASIL D, automotive spec chips.

    But they cost more!

    So you're cheap? NEXT!

    Chip lifetime bathtub curves are statistical in nature. When you run that many cpu's, their occasional failures might be expected; and failures don't need to be reproducible.

  44. yogidude

    An oldie but a goodie

    I am Pentium of Borg.

    You will be approximated

    1. Claptrap314 Silver badge

      Re: An oldie but a goodie

      The Pentium Bug was a design error. The microcode was programmed with a 0 in a table that was supposed to be something else. That's a completely different issue from this discussion.

      That chip did exactly what it was designed to do. (Some) of these do not.

      1. yogidude

        Re: An oldie but a goodie

        The original bug in eniac was also not part of the design, but gave its name to what we now refer to generically as a flaw in the operation of the device/software whatever the cause. That said, since posting the above I realised I omitted a line from the original (early 90s) joke.

        I am Pentium of Borg.

        Division is futile.

        You will be approximated.

  45. Piro

    CPU lockstep processing

    Somewhere, there are some old greybeard engineers that developed systems like NonStop that are shaking their heads, finally, we've hit a scenario in which having multiple CPUs running in lockstep would solve the issue!

    At least 3, so you can eliminate the "bad" core, generate an alarm, and keep on running.

    But no, everyone thought that specialist systems with all kinds of error checking and redundancy were wild overkill, and everything could be achieved by lots of commodity level hardware

    1. Charles 9

      Re: CPU lockstep processing

      Then we get the scenario when two of them agree but on the wrong answer...

      1. Ken Moorhouse Silver badge

        Re: Then we get the scenario when two of them agree but on the wrong answer...

        Yes. Seen this with a well-known accounts package. One error on the audit trail. Ran the data checker which assumed that the erroneous transaction was correct, and everything else was wrong.

    2. Claptrap314 Silver badge

      Re: CPU lockstep processing

      You every try designing a board like that? Triple the buses, 10x the fun! (And by fun, I mean electrical interference between buses.) Such a solution would be REALLY expensive.

  46. Anonymous Coward
    Anonymous Coward

    I had a 40 year career diagnosing apparently random IT problems that people said "couldn't happen". Electrical noise; physical noise; static; "in spec" voltages that weren't tight enough in all cases; cosmic particles; the sun shining. English Electric had to ban salted peanuts from the vending machines because the circuit board production line workers liked them.

    Murphy's Law always applies to any window of opportunity: "If anything can go wrong - it will go wrong". Plus the Sod's Law rider "..at the worst possible time".

    Back in the 1960s my boss had a pragmatic approach about apparent mainframe bugs "Once is not a problem - twice and it becomes a problem".

  47. daveyeager@gmail.com

    Isn’t this just another one of those things IBM knew about decades ago? I’m pretty sure they built in all these redundancies into their mainframes to catch exactly these types of very rare hardware malfunctions. And now the new kids on the block are like “look what we’ve discovered”

    Things I want to know: are they becoming more common because more cores means a higher odds one will go haywire? Or is it because the latest manufacturing nodes are less reliable? Are per transistor defect rates actually increasing? How much more prevalent are these errors compared to the past for the same number of cores or servers? Lots of unanswered questions in this article.

  48. anonymous boring coward Silver badge

    "One of our mercurial cores corrupted encryption," he explained. "It did it in such a way that only it could decrypt what it had wrongly encrypted."

    Of course Skynet would want us to think it's just accidental...

  49. Jason Hindle Silver badge

    I see this ending only one way

    A corrupted processor core and your computer attempting to use your smart home (or pacemaker) to murder you in the middle of the night.

  50. Anonymous Coward
    Anonymous Coward

    L3 cache ECC errors

    We've had a mystery case of some (brand new) Intel i7 NUCs suffering correctable and non-correctable level-3-cache ECC errors (giving a Machine Check Error) in the past 18 months.

    Only happens when they're heavily stressed throwing (video) data around.

    Seems to be hardware specific - get a "bad batch" which are problematic, showing up errors every few hours, or few 10's of hours.

    Same software runs for 100's or thousands of hours on other hardware specimens with no issue.

    Related? I don't know.

    1. Claptrap314 Silver badge

      Re: L3 cache ECC errors

      This does sound similar. Seriously, it might be worth a sit-down to talk through, even though it's been 15 years since this was my job. I don't know if you are big enough to merit an account rep with Intel or not, but if you are, be sure to complain--those parts are defective, and need to be replaced by Intel. (And Intel _should_ be pretty interested in getting their hands on some failing examples.)

      1. This post has been deleted by its author

      2. Anonymous Coward
        Anonymous Coward

        Re: L3 cache ECC errors

        We tried talking to Intel but initially they gave us the run-abouts ("we don't support Linux" etc). (We probably buy a few hundred per year - but growing exponentially)

        As far as I know we're still effectively "screening" new NUCs by running our code for 48-72 hours, and any that fault in that time are put to one side and not sent to customers. Any that report faults (via our telemetry) in the field will be swapped out at the earliest opportunity.

        Apparently 6 months or so after we started seeing the problem we did get some traction from Intel, sent them a NUC to analyse, and a few months later a BIOS update was issued - which fixed the problem on the NUCs we knew to be flaky at the time. Intel's BIOS release note cryptically says "Fixed issues with Machine Check Errors / Reboot while running game engine".

        I understand we've seen some occasional MCEs since the new BIOS on some specimens, although they may have different root cause...

        1. Claptrap314 Silver badge

          Re: L3 cache ECC errors

          Ouch, ouch, and OUCH!

          If you are seeing a steady problem while ordering less than a thousand CPUs, then this is a HUGE deal. A 1/1000 escape should stop the line. Seriously, talk to your management about going to the press with this.

          BECAUSE you are way, way too small for Intel to care about. And..Intel's parts are everywhere. This isn't going to just affect gamers & miners.

          As to what you can do in the mean time, a couple of things come to mind immediately, sorry if you are already going there.

          1) Double errors happen roughly at the square of the rate of single errors. Screen on corrected bits over a certain level, don't wait for the MCE.

          2) Check that you are actually running the parts in-spec. I know this sounds insane, but if your workload drives the temperature of the core too high, then you are running it out of spec. Of course, manufacturers do significant work to predict what the highest workload-driven effect can be, but don't trust it. Also, read the specs on the temperature sensors on the part very carefully. Running the core at X temp is not the same has having the senors report X temp.

          3) It sounds like it would be worthwhile to spend some time reducing your burn-in run. (I come from the manufacturer's side of things, so the economics are WAY different than I'm used to, but still...)

          a) Part of why I wanted you to check the temp spec so carefully is that if you are sure that you are in spec, you might be able to run at a slightly higher ambient temp & still be in spec. Why would you want to do this? Because the fails will happen faster if you do.

          b) Try to identify which part(s) of your workload are triggering the fails, and just run that part over & over. I had test code that could trigger the 750 Medal of Honor bug after 8-10 hours. Eventually, I got it to fail in 1 second.

          c) Try to see if there is a commonality to the memory locations that fail. As I mentioned with the Nintendo bug for the 750, it might be possible to target just a handful of cache lines to activate the failure.

  51. BPontius

    Mercurial, so basic predictable human behavior. Imagine the problems that will crop up if/when quantum computers become of such complexity? The computing error within a qubit with multiple values before collapsing the super positioning and the possibility of quantum entanglement between qubits.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like