back to article FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof

Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …

Page:

  1. GloriousVictoryForThePeople

    > "The other half is a mix of false accusations and limited reproducibility."

    Perfect for AI facial recognition workloads in Apple stores then.

  2. Moldskred

    "The mega-corp is currently relying on human-driven core integrity interrogation, [...]"

    That sentence sounds a lot more dystopian than it actually is -- which is a nice change of pace when talking about tech companies.

  3. Ken Moorhouse Silver badge

    Complexity: Another nail in the coffin...

    ...for the cloud.

    Before anyone posts the obvious rebuttal: note this phrase "two of the world's larger CPU stressors, Google and Facebook".

    If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time.

    I grew up with the concept of "the clock pulse". If we're pushing synchronous data to the limits (rise-fall time of data wrt clock pulses) then you could arguably get a skew effect. If designers are in denial about that then there are big problems ahead. (Rowhammer is a related problem).

    1. Brewster's Angle Grinder Silver badge

      Not all electrons are made equal...

      To me this sounds like quantum effects. No manufacturing process produces exact replicas; there is going to be subtle variation between chips. I don't know anything about modern chip design and manufacture so can't speculate what it could be. But electron behaviour is just the law of averages. And so whatever these defects are, it means electrons can periodically jump where they shouldn't. The smaller the currents, the fewer party* electrons are needed for this to become significant.

      * The party number is one of the important quantum numbers. It determines how likely an electron is to be an outlier. It's normally represented as a mullet.

      1. Anonymous Coward
        Anonymous Coward

        Re: Not all electrons are made equal...

        It's not quantum effects. Just process variation. These are factored in when closing the design. No teo devices will be the same. There is a spread. Some manufacturers will then use binning to grade parts by performance.

        1. Brewster's Angle Grinder Silver badge

          Forbidden gates

          But these are process variations that are being missed by manufacturers and where the chip generally functions as required. Just every once in a while it goes haywire. You could call it fate. You could call it luck. You could call it Karma. You could say it's mercurial or capricious. Or you could suspect some process variation allows tunnelling with low probability, or that some other odd transition or excitation is happening.

          1. Anonymous Coward
            Anonymous Coward

            Re: Forbidden gates

            No it doesn't really work like that.

            But there could be IRdrop, crosstalk or local heating issues. But all of these should be analysed during chip implementation and verification.

          2. Pseudononymous Coward
            Boffin

            Re: Forbidden gates

            It's just down to the statistics of very rare events with very large N. If you have a reliable processor with a clock speed of 10^9 hertz that gives you just one error every 10^20 clocks, then you can expect an error every 3000 years or so, say a one in five hundred or a thousand chance of seeing a single error during the 3-6 year life of the system. I can live with that for my laptop.

            But if you buy a million of those processors and run them in parallel in data centres then you will see roughly an error every day.

            1. General Purpose Bronze badge

              Re: Forbidden gates

              The trouble is that those errors aren't evenly spread. Specific individual cores go bad. The chances are against you having one of those in your laptop or one of your on-premises server, but if you do have one then you may experience a series of mysterious crashes, incorrect calculations and/or data loss, not just one incident.

      2. Roland6 Silver badge

        Re: Not all electrons are made equal...

        >To me this sounds like quantum effects.

        And to me too.

        Also shouldn't rule out cosmic radiation and other particles of interest that normal pass through stuff, yet given a sufficiently large sample will hit something...

    2. Cuddles Silver badge

      Re: Complexity: Another nail in the coffin...

      "If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time."

      I don't think it's anything to do with CPU time, but simply the number of CPUs. As the article notes, it's a few problematic cores per several thousand CPUs, ie. it's not random failures due to the large amount of use, it's some specific cores that have a problem. But since the problems are rare, only people operating many thousands of them are likely to actually encounter them. So it's a bit misleading to call them "stressors" of CPUs; it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs.

      So it's hard to say if on-prem would be better or not. On the one hand, you're unlikely to have enough CPUs to actually have a problem. But if you get unlucky and you do, the problematic core will be a greater percentage of your computing, and you're unlikely to be able to actually spot it at all. On the other hand, being assigned different CPUs every time you run a task in the cloud makes it almost inevitable that you'll encounter a troublesome core at some point. But it's unlikely to be a persistent problem since you won't have the same core next time, and the companies operating at that scale are able to assign the resources to actually find the problem.

      1. AndrewB57

        Re: Complexity: Another nail in the coffin...

        I **think** that means that there is no difference in the chance of corruption as experienced at a local level.

        Then again, 15% of statisticians will ALWAYS disagree with the other 90%

        1. Will Godfrey Silver badge
          Happy

          Re: Complexity: Another nail in the coffin...

          I see what you did there!

      2. John Brown (no body) Silver badge
        Devil

        Re: Complexity: Another nail in the coffin...

        "happen to use a lot of CPUs"

        What about GPUs, I wonder? Should the big crypto miners be getting concerned about now?

        1. Jon 37 Silver badge

          Re: Complexity: Another nail in the coffin...

          No, because of the way crypto is designed. Any miner who tries to submit a mined block, will have it tested by every other node on the network. If the miner's system glitched, then the block just won't be accepted. And this sounds rare enough that a miner would just shrug and move onto the next block.

      3. stiine Silver badge

        re: Cuddles: Re: Complexity: Another nail in the coffin...

        But that doesn't explain why they had crypto algorithms that would only decrpyx on the same cpu andcore that they were originally encrypted on.

        1. Jon 37 Silver badge

          Re: re: Cuddles: Complexity: Another nail in the coffin...

          "Stream ciphers", one of the common kinds of encryption algorithm, work by taking a key and generating a long string of pseudo-random numbers from that key. That then gets XOR'd into the data.

          It's the same algorithm to encrypt and to decrypt. (Like how ROT13 is the same algorithm to encrypt and to decrypt, except a lot more secure).

          So it's certainly possible that a core bug results in the specific sequence of instructions in the pseudo-random-number generator part giving the wrong answer. And it's certainly possible that is reproducible, repeating it with the same key gives the same wrong answer each time.

          That would lead to the described behaviour - encrypting on the buggy core gives a different encryption from any other core, so only the buggy core can decrypt it.

        2. bombastic bob Silver badge
          Devil

          Re: re: Cuddles: Complexity: Another nail in the coffin...

          maybe they need to use an encryption algorithm that isn't susceptible to (virtually) identical math errors during encryption and decryption. Then you could self-check by decrypting the encrypted output and comparing to the original. So long as the errors produce un-decryptable results, you should be fine.

      4. Michael Wojcik Silver badge

        Re: Complexity: Another nail in the coffin...

        it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs

        Well, it's also about how much of the time a given CPU (or rather each of its cores) is being used, since that's what gives you a result that might be incorrect. If a company "uses" a million cores but a given core is idle 90% of the time, they'll be much less likely to encounter a fault, obviously.

        So while "stressing" is probably not really an accurate term – it's not like they're using the CPUs outside their documented envelope (AFAIK) – "using more or less constantly" is a relevant qualification.

  4. Smudged
    Terminator

    Evolution of the microchip

    So Google have witnessed chips evolving to produce their own ransomware capability. What next? How far from Skynet are we?

  5. amanfromMars 1 Silver badge

    Just the cost of doing such business.

    The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests.

    If you consider chips as not too dissimilar from the networking of smarter humans, emerging anomalies are much easier to understand and be prepared for and accepted as just being an inherent endemic glitch always testing novel processes and processing there is no prior programming for.

    And what if they are not simply errors but other possibilities available in other realities/times/spaces/virtually augmented places?

    Are we struggling to make machines more like humans when we should be making humans more like machines….. IntelAIgent and CyberIntelAIgent Virtualised Machines?

    Prime Digitization offers Realisable Benefits.

    What is a computer other than a machine which we try to make think like us and/or for us? And what other model, to mimic/mirror could we possibly use, other than our own brain or something else SMARTR imagined?

    And if through Deeper Thought, our Brain makes a Quantum Leap into another Human Understanding such as delivers Enlightened Views, does that mean that we can be and/or are Quantum Computers?

    And is that likely to be a Feared and/or AWEsome Alien Territory?

    1. OJay

      Re: Just the cost of doing such business.

      for once, I was able to follow this train of thought until the end.

      So, what does that make me. Mobile Autonomous quaNtum unit?

      1. amanfromMars 1 Silver badge

        Re: Just the cost of doing such business.

        for once, I was able to follow this train of thought until the end.

        So, what does that make me. Mobile Autonomous quaNtum unit? ..... OJay>

        Gifted is a viable and pleasant thought, OJay, and would not be at all presumptuous. :-)

  6. Def Silver badge
    Coat

    That's mercurial as in unpredictable, not Mercurial as in the version control system of the same name.

    So what you're really saying is the version control system of the same name was aptly named.

  7. Anonymous Coward
    Anonymous Coward

    Once upon a time.....way back in another century......

    ......some of us (dimly) remember the idea of a standard development process:

    1. Requirements (how quaint!!!)

    2. Development

    3. Unit Test

    4. Functional Test

    5. Volume Test (also rather quaint!!)

    6. User Acceptance Test (you know...against item#1)

    .....where #4, #5 and #6 might overlap somewhat in the timeline.

    Another old fashioned idea was to have two (or three) separate installations (DEV, USER, PROD).......

    ......not sure how any of this old fashioned, twentieth century thinking fits in with "agile", "devops", "cloud"....and other "advanced" twenty first century thinking.

    ......but this article certainly makes this AC quite nostalgic for days past!

    1. Ken Moorhouse Silver badge

      Re: 5. Volume Test (also rather quaint!!)

      Every gig I've ever attended they never ever got past 2.

      1. Arthur the cat Silver badge

        Re: 5. Volume Test (also rather quaint!!)

        Every gig I've ever attended they never ever got past 2.

        <voice accent="yorkshire">You were lucky!</voice>

    2. Anonymous South African Coward Silver badge

      Re: Once upon a time.....way back in another century......

      In the Elder Days, when things was Less Rushed, sure, you could take your time with a product, and deliver a product that lived up to its promises.

      Nowadays in these Younger Days everything is rushed to market (RTM) after a vigorous spit 'n polish and sugarcoating session to hide most of Them Nasteh Buggreh Bugs. And nary a peep of said TNBB's either... hoping said NTBB's won't manifest themselves until closer to the End Lifetime of the Product.

      Case in point - MCAS.

      ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.

      1. Version 1.0 Silver badge
        Thumb Up

        Re: Once upon a time.....way back in another century......

        I never saw any problems with an 8080, 8085, 8048, or Z80 that I didn't create myself and fixed as soon as I saw the problem. Processors used to be completely reliable until the marketing and sales department start to want to add "features" which have lead to all of today's issues.

      2. Arthur the cat Silver badge

        Re: Once upon a time.....way back in another century......

        ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.

        On the other hand, back when, a friend of mine remarked that the Sinclair Scientific calculator was remarkably egalitarian, because if you didn't like the answer it gave you, you just had to squeeze the sides and it would give you a different one.

      3. SCP

        Re: Once upon a time.....way back in another century......

        ASAC wrote:

        "ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption."

        Is that with or without the pint of milk on top to keep it cool enough?

    3. Primus Secundus Tertius

      Re: Once upon a time.....way back in another century......

      AC has described the ideal case.

      In practice, there were repeats of item 1 between items 2 and 3, 3 and 4, etc. Table-thumping customer managements and toadying contractor sales people.

      (S)He also omits a necessary step between 1 and 2, namely the software design. The requirements stated what was thought to be required - not always a correct piece of analysis. The software design says how you get there in terms of data structures and algorithms. Once software got past transcribing maths into FORTRAN the SD was essential.

      For CPUs, replace software with microcode. This was even more problematical than orthodox code.

    4. Doctor Syntax Silver badge

      Re: Once upon a time.....way back in another century......

      7. Use in production.

      It's only in 7, and even then only at large scale that rare, sporadic failures become recognisable. Even if you were lucky enough to catch one at the previous stages you wouldn't be able to reproduce it reliably enough to understand it.

  8. Pascal Monett Silver badge

    "misbehaving cores"

    Is the solution of replacing the CPU with another identical not a good idea, or will the new one start misbehaving in the same way ?

    The article states that Google and FaceBook report a few cores in a thousand. That means that most CPUs are functioning just fine, so rip out the mercurial CPUs and replace them. That should give a chance of solving the immediate issue.

    Of course, then you take the misbehaving CPU and give it a good spanking, euh, put it in a test rig to find out just how it fails.

    1. Blank Reg Silver badge

      Re: "misbehaving cores"

      Replacing CPUs is the easy part. Detecting that a CPU needs replacing because it makes a mistake once every ten thousand hours is the hard part.

      1. Neil Barnes Silver badge

        Re: "misbehaving cores"

        Paraphrasing the jackpot computer in Robert Scheckley's Dimension of Miracles: I'm allowed to make one mistake in ten million and therefore not only I am going to, but I have to.

    2. Ken Moorhouse Silver badge

      Re: "misbehaving cores"

      The question is whether this is at the CPU level, the board level, the box level, system level. Tolerances* for all of these things gives rise to unacceptable possibilities - don't forget at the board/box level you've got power supplies and, hopefully UPS's attached to those. How highly do these data centres/centers rate these seemingly mundane sub assemblies, for example? (I'm sure many of us here have had experiences with slightly wayward PSU's).

      *The old-fashioned "limits and fits" is to my mind a better illustration of how components work with each other.

  9. Red Ted Silver badge
    Happy

    SETI saw result corruption too

    The SETI project used to see work units with corrupted results and they double checked all results.

    They attributed it to cosmic rays striking the micro and causing a bit flip.

    1. LDS Silver badge
      Alien

      They attributed it to cosmic rays striking the micro and causing a bit flip.

      It was just aliens hiding their presence. But they did on the checks too.

    2. Anonymous Coward
      Anonymous Coward

      Re: SETI saw result corruption too

      IIRC SETI also noticed that a lot of the corrupted results came from CPUs that had been overclocked.

    3. Wokstation

      Neutrinos!

      They occasionally bump stuff and can flip a bit - we're building more and more surface area of microchip, so it's only natural that neutrino hits would be proportionally more common.

  10. Brewster's Angle Grinder Silver badge

    Poacher turned gamekeeper

    I think they should hire that ransomware core as a cryptographer.

  11. Binraider

    Can we have ECC RAM supported by regular chipsets, please. Like we had certainly in the late 90's / early 2000's off the shelf. The sheer quantity of RAM and reduced tolerance to radiation means probability of bitflips are rather greater today than before.

    Either AMD or Intel could put support back into consumer chipsets as an easy way to get an edge over competitors.

    Regarding, CPU's, there's a reason satellite manufacturers are happy using a 20 year old architecture and manufacturing process at 200nm. Lower vulnerability to radiation-induced errors. (And using SRAM rather than DRAM too for same reason). Performance, cost, "tolerable" error. Rather less practical to roll back consumer performance (unless you fancy getting some genuinely efficient software out in circulation).

    1. Claptrap314 Silver badge

      I'm sure that F & G are using ECC ram already. It's always been out there, but the marginal cost has been enough that (usually) retail consumers avoid it. But I recall it from the '80's.

      1. Roland6 Silver badge

        >retail consumers avoid it.

        Err I think you'll find it is the manufacturers who avoid the use of ECC supporting chipsets in consumer products.

        1. Claptrap314 Silver badge

          Retail consumers avoid the cost that manufacturers charge for it. So...sure.

  12. Wolfclaw

    Pointless having a third core deciding on the tie breaker, the losing cores will simply call in the lawyers to overthrow the result and demand a recount.

  13. dinsdale54

    I have worked for a few hardware companies over the years and every single one has at some point had issues with random errors causing system crashes at above designed rates - these were all bit-flip errors.

    In each case the people who noticed first were our biggest customers. In one of these cases they way they discovered the problem was products from two different companies exhibiting random errors. A quick look at both motherboards showed the same I/O chipset in use. Radioactive contamination in the chip packaging was the root cause.

    You can mitigate these by putting multi-layer parity and ECC on every chip, bus and register with end-to-end checksumming. That will turn silent data corruption in to non-silent but it's also really expensive.

    But at least let's have ECC as standard!

    1. autopoiesis

      Radioactive contamination in the chip packaging - that's intriguing. I take it that wasn't determined during the 'quick look' phase ;)

      Nice find in any case - what was the contaminant, who found it, how long did it take etc?

      1. dinsdale54

        I forget the exact details - this was over 10 years ago - but IIRC systems that had generated these errors were put in a radiation test chamber and radioactivity measured. Once you have demonstrated there's a problem then it's down to the chipset manufacturer to find the issue. I think it was just low level contamination in the packaging material that occasionally popped out an Alpha particle and could flip a bit.

        The remediation is a massive PITA. I think we were dealing with it for about 2 years from initial high failure rates to having all the faulty systems replaced.

        Over the years I have spent far more of my career dealing with these issues than I would like. I put in a big shift remediating Seagate MOOSE drives that had silent data corruption as well.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021