back to article FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof

Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …

Page:

  1. Ilsa Loving

    Minority report architecture

    The only way to be sure would be to have at least 2 cores doing the same calculation each time. If they disagreed, run the calculation again. Alternatively you could have 3 cores doing the same calculation and if there's one core wrong then majority wins.

    Or we finally move to a completely new technology like maybe optical chips.

  2. steelpillow Silver badge
    Happy

    Collecting rarities

    "Rarely seen [things] crop up frequently"

    I love that so much, I don't want it to be sanity-ised.

    1. David 132 Silver badge

      Re: Collecting rarities

      AKA “Million to one chances crop up nine times out of ten”, as Sir Pterry put it…

  3. Teejay

    Brazil, here we come...

    Tuttle Buttle coming nearer.

    1. bazza Silver badge

      Re: Brazil, here we come...

      How are your ducts?!

  4. bsimon

    cosmic rays

    Well, I used to blame "cosmic rays" for bugs in my code, now I have a new excuse: "mercurial cores" ...

  5. Anonymous Coward
    Anonymous Coward

    Sheer volume of data might also be part of the issue. I'm reminded of a comment from a Reg reader who worked somewhere that processed massive data sets. He said something like "million-to-one shots happen every night for us".

  6. Persona Silver badge

    Networks can misbehave too

    I've seen it happen to network traffic too. With data being sent on a network going half way around the world occasionally a few bits got changed. The deep analysis showed that interference was hitting part of the route and the network error detection was doing its job, detecting it and getting it resent. Very very very rarely the interference corrupted enough bits to pass the network level error check.

    1. Anonymous Coward
      Anonymous Coward

      Re: Networks can misbehave too

      One remote comms device was really struggling - and the transmission errors generated lots of new crashes in the network controller. The reason was the customer had the comms cable running across the floor of their arc welding workshop.

      A good test of a comms link was to wrap the cable a few times round a hair dryer - then switch it on. No matter how good the CRC - there is always a probability of a particular set of corrupt data passing it.

  7. Claptrap314 Silver badge

    Been there, paid to do that

    I did microprocessor validation at AMD & IBM for a decade about two decades ago.

    I'm too busy to dig into these papers, but allow me to lay out what this sounds like. AMD never had problems of this sort while I was there (the validation team Dave Bass built was that good--by necessity.)

    Many large customers of both companies found it worthwhile to have their own validation teams. Apple in particular had validation team that was frankly capable of finding more bugs than the IBM team did in the 750 era. (AMD's customers in the 486 & K5 era would tell them about the bugs that they found in Intel's parts & demand that we match them.)

    Hard bugs are the ones that don't always happen--you can execute the same stream of instructions multiple times & get different results. This is almost certainly not the case for the "ransomware" bug. This rules out a lot of potential issues, including "cosmic rays" and "the Earth's magnetic field". (No BOFHs.)

    The next big question is whether these parts behave like this from the time that they were manufacturers, or if they are the result of damage that accumulates during the lifetime of any microprocessor. Variations in manufacturing process can create either of these. We run tests before the dies are cut to catch the first case. For the latter, we do burn-in tests.

    My first project at IBM was to devise a manufacturing test to catch a bug reported by Nintendo in about 3/1000 parts (AIR) during the 750 era. They wanted to find 75% of the bad parts. I took a bit longer than they wanted to isolate the bug, but my test came out 100% effective.

    My point is that this has always been an issue. Manufacturers exists to make money. Burn-in tests are expensive to create--and even more expensive to run. You can work with your manufacturer about these issues or you can embarrass them. Sounds like F & G are going for the latter.

    Oh, and I'm available. ;)

  8. pwjone1

    Error Checking and modern processor design

    To some degree, undetected errors are to be more expected as chip lithography evolves (14nM to 10nM to 7nM to 6 and 5 or 3nM). There is a history of dynamic errors (crosstalk, XI, and other causes), and the susceptibility to these gets worse as the device geometries get smaller -- just fewer electrons that need to leak. Localized heating also becomes more of a problem the denser your get. Obviously Intel has struggled to get to 10nM, potentially also a factor. But generally x86 (and Atom) processor designs have not had much error checking, the design point is that the cores "just work", and as that gradually has become more and more problematical, it may be that Intel/AMD/Apple/etc. will need to revisit their approach. IBM, on higher end servers (z), generally includes error checking, this is done via various techniques like parity/ECC on internal data paths and caches, predictive error checking (for example, on a state machine, you predict/check the parity or check bits on the next state), and in some cases full redundancy (like laying down two copies of the ALU or cores, comparing the results each cycle). To be optimal, you also need some level of software recovery, and there are varying techniques there, too. Added error checking hardware also has its costs, generally 10-15% and a bit of cycle time, depending on how much you put in and how exactly it is in implemented. So in a way, it is not too much of a surprise that Google (and others) have observed "Mercurial" results, any hardware designer would have shrugged and said "What did you expect? You get what you pay for."

  9. Draco
    Big Brother

    What sort of bloody dystopian Orwellian-tak is this?

    "Our adventure began as vigilant production teams increasingly complained of recidivist machines corrupting data," said Peter Hochschild.

    How were the machines disciplined? Were they given a warning? Did the machines take and pass the requisite unconscious bias training courses?

    "These machines were credibly accused of corrupting multiple different stable well-debugged large-scale applications. Each machine was accused repeatedly by independent teams but conventional diagnostics found nothing wrong with them."

    Did the machines have counsel? Were these accusations proven? Can the machines sue for slander and libel if the accusations are shown to be false?

    -----------

    English is my second language and those statements are truly mind numbing to read. In a certain context, they might be seen as humorous.

    What is wrong with the statements being written something more like:

    "Our adventure began as vigilant production teams increasingly observed machines corrupting data," said Peter Hochschild.

    "These machines were repeatedly observed corrupting multiple different stable well-debugged large-scale applications. Multiple independent teams noted corruptions by these machines even though conventional diagnostics found nothing wrong with them."

    1. amanfromMars 1 Silver badge

      Re: What sort of bloody dystopian Orwellian-tak is this?

      Nice one, Draco. Have a worthy upvote for your contribution to the El Reg Think Tank.

      Transfer that bloody dystopian Orwellian-tak to the many fascist and nationalistic geo-political spheres which mass multi media and terrifying statehoods are responsible for presenting, ...... and denying routinely they be held accountable for ....... and one of the solutions for machines to remedy the situation is to replace and destroy/change and recycle/blitz and burn existing prime established drivers /seditious instruction sets.

      Whether that answer would finally deliver a contribution for retribution and recalibration in any solution to the question that corrupts and perverts and subverts the metadata is certainly worth exploring and exploiting any time the issue veers towards destructively problematical and systemically paralysing and petrifying.

      Some things are just turned plain bad and need to be immediately replaced, old decrepit and exhausted tired for brand spanking new and tested in new spheres of engagement for increased performance and reassuring reliability/guaranteed stability.

      Out with the old and In with the new for a new dawn and welcoming beginning.

  10. Michael Wojcik Silver badge

    ObIT

    That's mercurial as in unpredictable, not Mercurial as in delay lines.

  11. Nightkiller

    Even CPUs are sensitive to Critical Race Theory. At some point you have to expect them to share their lived experience.

  12. Claptrap314 Silver badge

    Buggy processors--that work!

    The University of Michigan made a report (with the above title) around 1998 regarding research that they had done with a self-checking microprocessor. Thier design was to have a fully out-of-order processor do the computations and compare each part of the computation against an in-order core that was "led" by the out-of-order design. (The entire in-order core could be formally validated.) When there was a miscompare, the instruction would be re-run through the in-order core without reference to the "leading" of the out-of-order core. In this fashion, bugs in the out-of-order core would be turned into slowdowns.

    In order for the design to function appropriately, the in-order core required a level-0 cache. The result was that overall execution speed actually increased slightly. (AIR, this was because the out-of-order core was permitted to fetch from the L0 cache.)

    The design did not attract much attention at AMD. I assume that was mostly because our performance was so much beyond what our team believe this design could reach.

    Sadly, such a design does nothing to block SPECTER-class problems.

    In any event, the final issue is cost. F & G are complaining about how much cost they have to bear.

    1. runt row raggy

      Re: Buggy processors--that work!

      i must be missing something. if checking is required, isn't the timing limited to how fast the slowest (in-order) processor can work?

      1. Claptrap314 Silver badge

        Re: Buggy processors--that work!

        You would think so, wouldn't you?

        The design broke up an instruction into parts--instruction fetch, operand fetch, result computation, result store. (It's been >20 years--I might have this wrong.) The in-order core executed these four stages in parallel. It could do this because of the preliminary work of the out-of-order processor. The out-of-order core might take 15 cycles to do all four steps, but the in-order core does it in one--in no small part due to that L0. The in-order core was being drafted by the out-of-order core to the point that it could manage a higher IPC than the out-of-order core--as long as the data was available, which it often was not, of course.

  13. bazza Silver badge

    Sigh...

    I think there's a few people who need to read up on the works of Claude Shannon and Benoit Mandelbrot...

  14. DS999 Silver badge

    How do they know this is new?

    They only found it after a lot of investigation ruled out other causes. It may have been true in years past but no one had enough CPUs running the same code in the same place for long enough that they could tease out the root cause.

    So linking it to today's advanced processes may point us in the wrong direction, unless we can say for sure this wasn't happening with Pentium Pros and PA-RISC 8000s 25 years ago.

    I assume they have sent some of the suspect CPUs to Intel for them to take an electron microscope to look at the cores that exhibit the problem so they try to determine if it is some type of manufacturing variation, "wear", or something no one could prepare for like a one in a septillion neutrino collision with a nucleus changing the electrical characteristics of a single transistor by just enough that an edge condition error affecting that transistor becomes a one in a quadrillion chance?

    If they did, and Intel figures out why those cores go bad, will it ever become public? Or will Google and Intel treat it as a "competitive advantage" over others?

    1. SCP

      Re: How do they know this is new?

      I can't recall a case related to CPUs, but there were definitely cases like pattern sensitive RAM; resolved when the root cause was identified and design tools modified to avoid the issue.

      The "good old days" were not as perfect as our rose tinted glasses might lead us to recall.

      1. Claptrap314 Silver badge

        Re: How do they know this is new?

        We had a case of a power signal coupling a high bit in an address line leading out of the L1 in the 750. Stopped shipping product to Apple for a bit. Nasty, NASTY bug.

        I don't recall what exactly was the source of the manufacturing defect was on the Nintendo bug, but it only affected certain cells in the L2. Once you knew which ones to hit, it was easy to target them. Until I worked it out, though... Uggh.

    2. bazza Silver badge

      Re: How do they know this is new?

      Silicon chips do indeed wear. There's a phenomenon I've heard termed "electron wind" which causes the atoms of the element used to dope the silicon (which is what makes a junction) to be moved across that junction. Eventually they disperse throughout the silicon and then there's no junction at all.

      This is all related to current, temperature and time. More of any these makes the wearing effect faster.

      Combine the slowness with which that happens, and the effects of noise, temperature and voltage margins on whether a junction is operating as desired, and I reckon you can get effects where apparently odd behaviour can be quasi stable.

      1. Claptrap314 Silver badge
        Happy

        Re: How do they know this is new?

        I deliberately avoided the term "hole migration" because it tends to cause people's heads to explode, but yeah.

        And not just quasi-stable. The effects of hole migration can be VERY predictable. Eventually, the processor becomes inert!

  15. Anonymous Coward
    Anonymous Coward

    what data got corrupted, exactly?

    While the issue of rogue cores is certainly important, since they could possibly do bad things to useful stuff (e.g., healthcare records, first-response systems), I wonder if this will pop up later in a shareholders meeting about why click-throughs (or whatever they use to price advert space) are down. "Um, it was, ahh, data corruption, due to, ahm, processor cores, that's it."

  16. Bitsminer Bronze badge

    Reminds of silent disk corruption a few years ago

    Google Peter Kelemen at CERN; he was part of a team that identified a high rate of disk data corruption amongst the thousands of rotating disk drives at CERN. This was back in 2007.

    Among the root causes, the disks were a bit "mercurial" about writing data into the correct sector. Sometimes it got written to the wrong cylinder, track, and block. That kind of corruption, even on a RAID5 set, results in a write-loss of correct data that ultimately can invalidate a complete file system.

    Reasoning is as follows: write a sector to the wrong place (data), and write the matching RAID-5 parity to the right place. Later, read the data back and get a RAID-5 parity error. To fix the error, rewrite the (valid, but old) data back in place because the parity is a mismatch. Meanwhile, the correct data at the wrong place lives on. When that gets read, the parity error is detected and the original (and valid) data is rewritten. The net net of this: loss of the written sector. If this is file system metadata, it can break the file system integrity.

    1. bazza Silver badge

      Re: Reminds of silent disk corruption a few years ago

      Yes I remember that.

      It was for reasons like this that Sun developed the ZFS file system. It has error checking and correction up and down the file system, designed to probably give error free operations over exabyte filesystems.

      Modern storage devices are close to being such that if you read the whole device twice, you will not get the same bits returned. 1 will be wrong.

  17. captain veg

    so...

    ... we've already got quantum computing, and no one noticed?

    -A.

    1. LateAgain

      Re: so...

      We may, or may not, have noticed.

    2. Claptrap314 Silver badge
      Trollface

      Re: so...

      You do know what a transistor is, correct?

  18. Throatwarbler Mangrove Silver badge
    FAIL

    For shame!

    No one has noticed the opportunity for a new entry in the BOFH Excuse Calendar?

    1. Claptrap314 Silver badge

      Re: For shame!

      Just another page for the existing one, my friend.

  19. itzman

    well its obvious

    that as a data one gets represented by e,g, less and less electrons, statistically the odd data one will fall below the threshold for being a one, and become a zero, and vice versa.

    or a stray cosmic ray will flip a flop.

    or enough electrons will tunnel their way to freedom....

  20. rcxb Silver badge

    Higher datacenter temperatures contributing?

    One has to wonder if the sauna-like temperatures Google and Facebook are increasingly running their datacenters at, is contributing to the increased rate of CPU-core glitches.

    They may be monitoring CPU temperatures to ensure they don't exceed the spec sheet maximums, but any real-world device doesn't have a vertical cliff dropoff, and the more extreme conditions it operates in, the sooner some kind of failure can be expected. The speedometer in my car goes significantly into the tripple-digits, but I wouldn't be shocked if driving it like a race car would result in mechanical problems rather sooner in its life-cycle.

    Similarly, high temperatures are frequently used to simulate years of ageing with various equipment.

    1. Claptrap314 Silver badge

      Re: Higher datacenter temperatures contributing?

      I was never privileged to tour our datacenters, but I am HIGHLY confident that G is careful to run the chips in-spec. When you've got that many millions of processors in the barn, a 1% failure rate is E.X.P.E.N.S.I.V.E.

      Now, for decades, that spec is a curve and not a point. (IE: don't run over this speed at this temperature, or that speed at that temperature) This means that "in spec" is a bit more broad the naive' approach might guess.

      They also have temperature monitors to trigger a shutdown if the temp spikes too high for too long. They test these monitors.

  21. Anonymous Coward
    Anonymous Coward

    Um. Clockspeed

    Surprised no one has mentioned

    As a core ages it will struggle to maintain high clocks and turbo speeds

    For A core that was marginal when new but passed initial Val it's not surprising to see it start to behave unpredictability as it gets older. You see it all the time in overclocked systems, but normally the CPU is running an OS so it'll BSOD on a Driver before it starts to actually make serious mistakes in an app. A lone core running compute in a server it's not surprising that it'll start to misstep and would align with their findings

    Identify mercurial cores and drop.the clocks by 10%, not rocket science

    1. Ken Moorhouse Silver badge

      Re: Um. Clockspeed. Surprised no one has mentioned

      I have mentioned both Clock Skew and the interaction of tolerances ("Limits and Fits") between components/sub-assemblies in this topic earlier.

      ===

      The problem with Voting systems is that the integrity of Clock systems has to be complete. The Clock for all elements of the Voting system has to be such that there is no chance that the results from one element are received outside of the clock period to ensure they are included in this "tick's" vote. If included in the next "tick's" vote then not only does it affect the result for this "tick", but the next "tick" too, which is susceptible to a deleterious cascade effect. I'm assuming that it is prudent to have three separate teams of developers, with no shared libraries, for a 2 in 3 voting system to eliminate the effect of common-mode design principles, which might fail to fault on errors.

      If applying Voting systems to an asynchronous system, such as TCP/IP messaging (where out-of-band packet responses are integral to the design of the system), how do you set time-outs? If they are set too strict then you get the deleterious snowball effect, bringing down the whole system. Too slack and you might just as well use legacy technology.

  22. Snowy
    Coat

    Google and Facebook designed CPUs

    Google and Facebook started to use CPUs that they had designed themselves are these the ones that are producing errors?

    1. Claptrap314 Silver badge

      Re: Google and Facebook designed CPUs

      As far as I can tell, those really are more like SOCs and/or GPUs. And, no, they are complaining about someone else's work.

  23. They call me Mr Nick
    FAIL

    Floating Point Fault

    Many years ago while at University I heard of an interesting fault.

    A physicist had run the same program a few days apart but had got different results. And had noticed.

    Upon investigation it transpired that the floating point unit of the ICL mainframe was giving incorrect answeres in the 4th decimal place.

    This was duly fixed. But I wondered at the time how many interesting discoveries in physics were actually undiscovered hardware failures.

    1. Ken Moorhouse Silver badge

      Re: Floating Point Fault. had got different results

      That makes it nondeterministic, which is subtley different to giving incorrect (yet consistent) answers in the 4th decimal place.

      Maybe the RAM needed to be flushed each time the physicist's program was run. Perhaps the physicist was not explicitly initialising variables before use.

      1. Doctor Syntax Silver badge

        Re: Floating Point Fault. had got different results

        "Perhaps the physicist was not explicitly initialising variables before use."

        FORTRAN -

        SOMEWARIABLE = 0

        Later on

        SOMEVARIABLE = SOMEVARIABLE + X

        1. Ken Moorhouse Silver badge

          Re: SOMEWARIABLE = 0

          One of the reasons I like Delphi so much is that cannot easily happen without ignoring compilation warnings, and keeping global declarations to the absolute minimum.

        2. Binraider

          Re: Floating Point Fault. had got different results

          Python, for all of it's usefulness, weak type checking is a major bugbear. Especially when you have a couple thousand variable names to work with.

    2. Claptrap314 Silver badge

      Re: Floating Point Fault

      This is why science involves others reproducing your work.

  24. John Savard Silver badge

    Somebody thought of this before

    Of course, though, IBM has been designing their mainframe CPUs with built-in error detection all along. And it still does, even though they're on microchips.

    1. Claptrap314 Silver badge

      Re: Somebody thought of this before

      That's mostly a different issue. But yeah, some of those server parts have more silicon to detect problems (not just errors) than is in the cores.

  25. Paddy
    Pint

    You get what you paid for.

    Doctor, doctor, it hurts if I use this core!

    Then use one of your millions of others?

    But I need all that I buy!

    Then buy safer, ASIL D, automotive spec chips.

    But they cost more!

    So you're cheap? NEXT!

    Chip lifetime bathtub curves are statistical in nature. When you run that many cpu's, their occasional failures might be expected; and failures don't need to be reproducible.

  26. yogidude

    An oldie but a goodie

    I am Pentium of Borg.

    You will be approximated

    1. Claptrap314 Silver badge

      Re: An oldie but a goodie

      The Pentium Bug was a design error. The microcode was programmed with a 0 in a table that was supposed to be something else. That's a completely different issue from this discussion.

      That chip did exactly what it was designed to do. (Some) of these do not.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021