User topics

Article topics

Log in Sign up

FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof

Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Friday 4th June 2021 00:19 GMT Blackjack

Somehow this will all end with Intel having part of the blame...

12 4 Reply
1. Friday 4th June 2021 08:57 GMT imanidiot
  
  Since some parts of modern x86 and x64 chip designs result directly or indirectly from decisions Intel has made in the past, it's likely at least some small part of the blame will lie with Intel. Whether they should have known better (like with the whole IME and predictive threading debacle) remains to be seen.
  
  22 4 Reply
2. Friday 4th June 2021 11:24 GMT The Man Who Fell To Earth
  
  Allow one to disable a core
  
  One band-aide would be to simply stop using a core found to be mercurial, ideally by switching it off, less ideally by having the OS avoid using it.
  
  2 0 Reply
  1. Friday 4th June 2021 12:55 GMT Cynic_999
    
    Re: Allow one to disable a core
    
    But that assumes that an error in a particular core has been detected in the first place. Where errors would have serious consequences, I would say the best policy would be to have at least 2 different cores or CPUs running the same code in parallel, and checking that the results match. Using 3 cores/CPUs would be better and allow the errant device to be detected. Best for each to have their own RAM as well.
    
    20 0 Reply
    1. Friday 4th June 2021 18:07 GMT HammerOn1024
      
      Re: Allow one to disable a core
      
      So much for my power supplies... and power plants... and power infrastructure in general.
      
      1 0 Reply
    2. Friday 4th June 2021 19:51 GMT SCP
      
      Re: Allow one to disable a core
      
      aka Lockstep processing, with Triple Core Lockstep (TCLS) being something proposed by ARM.
      
      4 1 Reply
  2. Friday 4th June 2021 23:14 GMT AVR
    
    Re: Allow one to disable a core
    
    Assuming that it's not effectively running its own private encryption as described in the article. That might require you to switch it back on for a while.
    
    Also, apparently even finding out that the core is producing errors can be tricky.
    
    1 0 Reply
    1. Sunday 6th June 2021 20:30 GMT Malcolm 5
      
      Re: Allow one to disable a core
      
      that comment about encryption that can only be reversed by the same core intrigued me - maybe I am thinking too advanced but is anything other that XOR encryption really so symmetric that a processor bug would hit encryption and decryption "the same"
      
      I guess it could be that there was a stream calculation was doing the same thing and feeding into an XOR with the data
      
      2 0 Reply
3. Friday 4th June 2021 13:50 GMT NoneSuch
  
  Out of the Box Thinking
  
  The grief is probably caused by native NSA authored microcode that needs an update.
  
  12 4 Reply
4. Monday 7th June 2021 11:30 GMT Anonymous Coward
  
  There will be a logo and a silly name; just wait and see.
  
  4 0 Reply
Friday 4th June 2021 01:06 GMT Richard Boyce

Error detection

We've long had ECC RAM available, but only really critical tasks have had CPU redundancy for detecting and removing errors. Maybe it's time for that to change. As chips have more and more cores added, perhaps we could usefully use an option to tie cores together in threes to do the same tasks with majority voting to determine the output.

35 0 Reply
1. Friday 4th June 2021 04:02 GMT AndrewV
  
  Re: Error detection
  
  It would be more efficient to only call in the third as a tiebreaker.
  
  3 0 Reply
  1. Friday 4th June 2021 06:18 GMT yetanotheraoc
    
    Re: Error detection
    
    As in, appeal to the supreme core?
    
    57 0 Reply
    1. Friday 4th June 2021 11:33 GMT Ben Bonsall
      
      Re: Error detection
      
      As in minority report...
      
      10 0 Reply
      1. Friday 4th June 2021 11:46 GMT juice
        
        Re: Error detection
        
        >As in minority report...
        
        I think that "voting" concept has been used in a few places - including, if memory serves, the three "Magi" in Neon Genesis Evangelion.
        
        https://wiki.evageeks.org/Magi
        
        There's even a relatively obscure story about a Bolo (giant sentient tanks), in which the AI's multi-core hardware is failing, and it has to bring a human along for the ride while fighting aliens, since there's a risk that it'll end up stuck with an even number of "votes" and will need to ask the human to act as a tie-breaker...
        
        5 0 Reply
        
        Friday 4th June 2021 18:34 GMT J. Cook
        
        Re: Error detection
        
        ... rumor has it that the Space Shuttle's early avionics and main computer was something akin to a 5 node cluster working on the same calculations, and that the result that at least three of the nodes had to be identical.
        
        of something like that.
        
        4 0 Reply
        
        Friday 4th June 2021 19:09 GMT bombastic bob
        
        Re: Error detection
        
        without revealing [classified information] the concept of "2 out of 3" needed to initiate something, such as [classified information], might even use an analog means of doing so, and pre-dates the space shuttle [and Evangelion] by more than just a few years.
        
        Definitely a good idea for critical calculations, though.
        
        2 0 Reply
        
        Friday 4th June 2021 22:28 GMT martinusher
        
        Re: Error detection
        
        Two out of three redundancy is as old as the hills. It can be made a bit more reliable by having different systems arrive at the result -- instead of three (or more) identical boxes you distribute the work among different systems so that the likelihood of an error showing up in more than one system is minimized.
        
        The problem with this sort of approach is not just bulk but time -- like any deliberative process you have to achieve a consensus to do anything which inevitably delays the outcome.
        
        5 1 Reply
        
        Friday 4th June 2021 22:30 GMT General Purpose
        
        Re: Error detection
        
        Something like this?
        
        During timecritical mission phases (i.e., recovery time less than one second), such as boost, reentry, and landing, four of these computers operate as a redundant set, receiving the same input data, performing the same flight-critical computations, and transmitting the same output commands.(The fifth computer performs non-critical computations.) In this mode of operation, comparison of output commands and “voting” on the results in the redundant set provide the basis for efficient detection and identification of two flight-critical computer failures. After two failures, the remaining two computers in the set use comparison and self-test techniques to provide tolerance of a third fault.
        
        5 0 Reply
  2. Friday 4th June 2021 19:22 GMT YARR
    
    Re: Error detection
    
    The problem with a 3rd tiebreaker is that there’s about a 1/n^2 probability of that also being ‘mercurial’ and favouring the corrupt core over the working one.
    
    2 1 Reply
    1. Monday 7th June 2021 10:42 GMT FeepingCreature
      
      Re: Error detection
      
      Well, only if they fail in the same way. Which given there is only one correct answer but a near unlimited number of wrong answers, seems quite unlikely.
      
      1 0 Reply
      1. Thursday 10th June 2021 15:28 GMT EnviableOne
        
        Re: Error detection
        
        when the answer is only 0 or 1, there are numerous ways the answer can end up wrong or invalid
        
        even down to cosmic rays (there are open bugs in cisco equipment with background radiation and em interference as known causes)
        
        1 0 Reply
2. Friday 4th June 2021 08:00 GMT cyberdemon
  
  Re: Error detection
  
  Nah, they'll just hide the errors under layer upon inscrutable layer of neural network, and a few arithmetic glitches will probably benefit the model as a whole.
  
  So instead of being a function of its input and training data, and coming to a conclusion like "black person == criminal" it will say something like "bork bork bork, today's unperson of the day is.. Richard B~~uttle~~oyce"
  
  5 1 Reply
  1. Friday 4th June 2021 16:43 GMT SCP
    
    Re: Error detection
    
    Perhaps it should be a
    
    +++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
    
    error.
    
    20 0 Reply
3. Friday 4th June 2021 08:27 GMT Natalie Gritpants Jr
  
  Re: Error detection
  
  You can configure some ARM cores this way and crash on disagreement. Doubles your power consumption and chip area though.
  
  9 0 Reply
  1. Friday 4th June 2021 11:06 GMT John Robson
    
    Re: Error detection
    
    Depends on what's happening - it sounds like this is an issue with specific cpus/cores occasionally.
    
    In which case you only need to halve your processor speed periodically throughout life to pick up any discrepancies.
    
    1 0 Reply
    1. Friday 4th June 2021 19:22 GMT bombastic bob
      
      Re: Error detection
      
      I dunno about half speed... but certainly limit the operating temperature.
      
      more than likely it's caused by running at higher than average temperatures (that are still below the limit) which cause an increase in hole/electron migration within the gates [from entropy] and they become weakened and occasionally malfunction...
      
      (at higher temperatures, entropy is higher, and therefore migration as well)
      
      I'm guessing that these malfunctioning devices had been run at very high temperatures, almost continuously, for a long period of time [years even]. Even though the chip spec allows temperatures to be WAY hotter than they usually run at, it's probably not a good idea to LET this happen in order to save money on cooling systems (or for any other reason related to this).
      
      On several occasions I've seen overheated devices malfunction [requiring replacement]. In some cases it was due to bad manufacturing practices (an entire run of bad boards with dead CPUs). I would expect that repeated exposure to maximum temperatures over a long period of time would eventually have the same effect.
      
      4 0 Reply
4. Friday 4th June 2021 13:41 GMT SCP
  
  Re: Error detection
  
  I believe that sort of architecture (multi-core cross comparison) has already been proposed in the ARM triple-core-lockstep (TCLS). This is an extension to classic lockstep and offers correction-on-the-fly. [Not sure where they are on realization).
  
  2 0 Reply
5. Friday 4th June 2021 15:39 GMT Anonymous Coward
  
  Re: Error detection
  
  That is a lot of silicon being dedicated to a problem that can be solved with less.
  
  It's possible and has been done where I used to work, Sussex uni, to implement error checking in logic gates. A researcher around 20 years ago was generating chip designs for error checking, finding the least number of gates needed, reducing the current, at the time, design size.
  
  He was back then using a GA to produce needed layouts and found many more efficient ones than was currently in use (provided them free to use). This could be applied to cpus, and is for critical systems, but as it uses more silicon, isn't in consumer cpus as that adds cost for no performance gain.
  
  5 0 Reply
  1. Friday 4th June 2021 15:50 GMT Anonymous Coward
    
    Re: Error detection
    
    May have been this guy.
    
    https://users.sussex.ac.uk/~mmg20/index.html
    
    4 0 Reply
6. Friday 4th June 2021 16:26 GMT EveryTime
  
  Re: Error detection
  
  CPU redundancy has been around almost since the beginning of electronic computing, but it largely disappeared in the early 1990s as caching and asynchronous interrupts made cycle-by-cycle comparison infeasible.
  
  My expectation is that this will turn out to be another in a long history of misunderstanding faults. It's seeing a specific design error and mistaking it for a general technology limit.
  
  My first encounter of this was when dynamic RAM was suffering from high fault rates. I read many stories on how the limit of feature size had been reached. The older generation had been reliable, so the speculation was that the new smaller memory capacitors had crossed the threshold where every cosmic ray would flip bits. I completely believed those stories. Then the next round of stories reported that the actual problem was the somewhat radioactive ceramic used for the chip packaging. Using to a different source for ceramic avoided the problem, and it was a motivation to simply change to less expensive plastic packages.
  
  The same thing happened repeatedly over the years in supercomputing/HPC. Researchers thought that they spotted disturbing trends in the largest installed systems. What they found was always a specific solvable problem, not a general reliability limit to scaling.
  
  14 0 Reply
7. Friday 4th June 2021 16:55 GMT Warm Braw
  
  Re: Error detection
  
  The approach adopted by Tandem Computers was to duplicate everything - including memory and persistent storage -- as you can get bus glitches, cache glitches and all sorts of other transient faults in "shared" components which you would not otherwise be able to detect simply from core coupling. But even that doesn't necessarily protect against systematic errors where every instance of (say) the processor makes the same mistake repeatably.
  
  It's a difficult problem: and don't forget that many peripherals will also have processors in them, it's not just the main CPU you have to look out for.
  
  4 0 Reply
8. Friday 4th June 2021 17:19 GMT Anonymous Coward
  
  Re: Error detection
  
  So maybe it's all an Intel plot to sell three times as much hardware? ...
  
  9 0 Reply
9. Friday 4th June 2021 19:04 GMT bombastic bob
  
  Re: Error detection
  
  CPU redundancy may be easier than people may want to admit...
  
  If your CPU has multiple (actual) cores, for "critical" operations you could run two parallel threads. If your threads can be assigned "CPU affinity" such that they don't hop from CPU to CPU as tasks switch around then you can compare the results to make sure they match. If you're REALLY paranoid, you can use more than 2 threads.
  
  If it's a VM then the hypervisor (or emulator, or whatever) would need to be able to ensure that core to thread affinity is supported.
  
  0 0 Reply
10. Sunday 6th June 2021 21:59 GMT Anonymous Coward
  
  Re: Error detection and elimination
  
  "only really critical tasks have had CPU redundancy for detecting and removing errors. "
  
  Tandem Nonstop mean anything to you?
  
  Feed multiple nominally identical computer systems the same set of inputs and if they don't have the same outputs something's gone wrong (massively oversimplified).
  
  Lockstep at IO level rather than instruction level (how does instruction-level lockstep deal with things like soft errors in cache memory, that can be corrected but are unlikely to occur simultaneously on two or more systems being compared).
  
  Anyway, it's mostly been done before. Just not by the Intel/Windows world.
  
  1 0 Reply
11. Monday 7th June 2021 07:08 GMT The Oncoming Scorn
  
  Re: Error detection
  
  Wheres my minority report!
  
  1 0 Reply
12. Monday 7th June 2021 11:00 GMT Tom 7
  
  Re: Error detection
  
  Could cause more problems than it solves. If all three cores are close to each other on the die and the error is one of the 'field type' (where lots of certain activity in a certain area of the chip causes the problem) then all three cores could fall for the same problem and provide identical incorrect results thus giving the illusion all is ok,
  
  0 0 Reply
Friday 4th June 2021 01:19 GMT Fruit and Nutcase

The Spanish Inquisition

"we must extract 'confessions' via further testing"

Were good at extracting confessions. May be the Google boffins can learn a few techniques from them

12 0 Reply
1. Friday 4th June 2021 05:46 GMT Neil Barnes
  
  Re: The Spanish Inquisition
  
  Fear and surprise are our weapons, and, er, being mercurial...
  
  30 0 Reply
  1. Friday 4th June 2021 07:52 GMT Paul Crawford
    
    Re: The Spanish Inquisition
    
    Our three methods are fear, surprise, being mercurial. Oh and an almost fanatical devotion to IEEE 754 Standard for Floating-Point Arithmetic!
    
    Damn! Among our methods are fear...
    
    38 0 Reply
    1. Friday 4th June 2021 08:24 GMT Alan Brown
      
      Re: The Spanish Inquisition
      
      as long as you don't round over multiple iterations (long story behind this comment....)
      
      8 0 Reply
      1. Friday 4th June 2021 08:34 GMT Ken Moorhouse
        
        Re: (long story behind this comment....)
        
        You can tell us... over multiple iterations, if you like.
        
        19 0 Reply
        
        Friday 4th June 2021 16:15 GMT Claptrap314
        
        Re: (long story behind this comment....)
        
        (Who started out in floating point validation)
        
        slowly shakes head...
        
        3 0 Reply
    2. Friday 4th June 2021 20:36 GMT jmch
      
      Re: The Spanish Inquisition
      
      We'll come in again
      
      2 0 Reply
2. Friday 4th June 2021 11:22 GMT Anonymous South African Coward
  
  Re: The Spanish Inquisition
  
  Aren't a Mother Confessor and a team of Mord-Siths more practical for getting out confessions?
  
  2 0 Reply
3. Friday 4th June 2021 14:39 GMT Irony Deficient
  
  Maybe the Google boffins can learn a few techniques from them.
  
  Ximénez: Now, old woman — you are accused of heresy on three counts: heresy by thought, heresy by word, heresy by deed, and heresy by action — four counts. Do you confess?
  
  Wilde: I don’t understand what I’m accused of.
  
  Ximénez: Ha! Then we’ll make you understand! Biggles! Fetch … the cushions!
  
  9 0 Reply
Friday 4th June 2021 02:20 GMT fredesmite2

fault tolerance

There is little in fault tolerance and detection anymore .. silent data corruption is more the norm in Intel mass produced servers.

3 1 Reply
Friday 4th June 2021 03:55 GMT Joe W

auto-erratic ransomware

Oh.

that's erratic

(sorry....)

15 0 Reply
1. Friday 4th June 2021 15:00 GMT Arthur the cat
  
  Re: auto-erratic ransomware
  
  Just wait a while. All technology gets used for porn.
  
  3 0 Reply
  1. Friday 4th June 2021 18:41 GMT J. Cook
    
    Re: auto-erratic ransomware
    
    I think y'all ment auto-erotic.
    
    I'll be in my bunk.
    
    1 0 Reply
Friday 4th June 2021 05:31 GMT GloriousVictoryForThePeople

> "The other half is a mix of false accusations and limited reproducibility."

Perfect for AI facial recognition workloads in Apple stores then.

22 0 Reply
Friday 4th June 2021 07:05 GMT Moldskred

"The mega-corp is currently relying on human-driven core integrity interrogation, [...]"

That sentence sounds a lot more dystopian than it actually is -- which is a nice change of pace when talking about tech companies.

8 0 Reply
Friday 4th June 2021 07:37 GMT Ken Moorhouse

Complexity: Another nail in the coffin...

...for the cloud.

Before anyone posts the obvious rebuttal: note this phrase "two of the world's larger CPU stressors, Google and Facebook".

If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time.

I grew up with the concept of "the clock pulse". If we're pushing synchronous data to the limits (rise-fall time of data wrt clock pulses) then you could arguably get a skew effect. If designers are in denial about that then there are big problems ahead. (Rowhammer is a related problem).

15 5 Reply
1. Friday 4th June 2021 10:23 GMT Brewster's Angle Grinder
  
  Not all electrons are made equal...
  
  To me this sounds like quantum effects. No manufacturing process produces exact replicas; there is going to be subtle variation between chips. I don't know anything about modern chip design and manufacture so can't speculate what it could be. But electron behaviour is just the law of averages. And so whatever these defects are, it means electrons can periodically jump where they shouldn't. The smaller the currents, the fewer party* electrons are needed for this to become significant.
  
  _{* The party number is one of the important quantum numbers. It determines how likely an electron is to be an outlier. It's normally represented as a mullet.}
  
  10 2 Reply
  1. Friday 4th June 2021 11:03 GMT Anonymous Coward
    
    Re: Not all electrons are made equal...
    
    It's not quantum effects. Just process variation. These are factored in when closing the design. No teo devices will be the same. There is a spread. Some manufacturers will then use binning to grade parts by performance.
    
    4 3 Reply
    1. Friday 4th June 2021 12:42 GMT Brewster's Angle Grinder
      
      Forbidden gates
      
      But these are process variations that are being missed by manufacturers and where the chip generally functions as required. Just every once in a while it goes haywire. You could call it fate. You could call it luck. You could call it Karma. You could say it's mercurial or capricious. Or you could suspect some process variation allows tunnelling with low probability, or that some other odd transition or excitation is happening.
      
      5 1 Reply
      1. Friday 4th June 2021 14:08 GMT Anonymous Coward
        
        Re: Forbidden gates
        
        No it doesn't really work like that.
        
        But there could be IRdrop, crosstalk or local heating issues. But all of these should be analysed during chip implementation and verification.
        
        2 0 Reply
      2. Friday 4th June 2021 17:18 GMT Anonymous Coward
        
        Re: Forbidden gates
        
        It's just down to the statistics of very rare events with very large N. If you have a reliable processor with a clock speed of 10^9 hertz that gives you just one error every 10^20 clocks, then you can expect an error every 3000 years or so, say a one in five hundred or a thousand chance of seeing a single error during the 3-6 year life of the system. I can live with that for my laptop.
        
        But if you buy a million of those processors and run them in parallel in data centres then you will see roughly an error every day.
        
        12 0 Reply
        
        Friday 4th June 2021 22:41 GMT General Purpose
        
        Re: Forbidden gates
        
        The trouble is that those errors aren't evenly spread. Specific individual cores go bad. The chances are against you having one of those in your laptop or one of your on-premises server, but if you do have one then you may experience a series of mysterious crashes, incorrect calculations and/or data loss, not just one incident.
        
        3 0 Reply
  2. Friday 4th June 2021 21:22 GMT Roland6
    
    Re: Not all electrons are made equal...
    
    >To me this sounds like quantum effects.
    
    And to me too.
    
    Also shouldn't rule out cosmic radiation and other particles of interest that normal pass through stuff, yet given a sufficiently large sample will hit something...
    
    3 1 Reply
2. Friday 4th June 2021 11:23 GMT Cuddles
  
  Re: Complexity: Another nail in the coffin...
  
  "If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time."
  
  I don't think it's anything to do with CPU time, but simply the number of CPUs. As the article notes, it's a few problematic cores per several thousand CPUs, ie. it's not random failures due to the large amount of use, it's some specific cores that have a problem. But since the problems are rare, only people operating many thousands of them are likely to actually encounter them. So it's a bit misleading to call them "stressors" of CPUs; it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs.
  
  So it's hard to say if on-prem would be better or not. On the one hand, you're unlikely to have enough CPUs to actually have a problem. But if you get unlucky and you do, the problematic core will be a greater percentage of your computing, and you're unlikely to be able to actually spot it at all. On the other hand, being assigned different CPUs every time you run a task in the cloud makes it almost inevitable that you'll encounter a troublesome core at some point. But it's unlikely to be a persistent problem since you won't have the same core next time, and the companies operating at that scale are able to assign the resources to actually find the problem.
  
  11 0 Reply
  1. Friday 4th June 2021 12:46 GMT AndrewB57
    
    Re: Complexity: Another nail in the coffin...
    
    I **think** that means that there is no difference in the chance of corruption as experienced at a local level.
    
    Then again, 15% of statisticians will ALWAYS disagree with the other 90%
    
    15 0 Reply
    1. Friday 4th June 2021 19:05 GMT Will Godfrey
      
      Re: Complexity: Another nail in the coffin...
      
      I see what you did there!
      
      0 0 Reply
  2. Friday 4th June 2021 14:25 GMT John Brown (no body)
    
    Re: Complexity: Another nail in the coffin...
    
    "happen to use a lot of CPUs"
    
    What about GPUs, I wonder? Should the big crypto miners be getting concerned about now?
    
    0 0 Reply
    1. Friday 4th June 2021 18:49 GMT Jon 37
      
      Re: Complexity: Another nail in the coffin...
      
      No, because of the way crypto is designed. Any miner who tries to submit a mined block, will have it tested by every other node on the network. If the miner's system glitched, then the block just won't be accepted. And this sounds rare enough that a miner would just shrug and move onto the next block.
      
      1 0 Reply
  3. Friday 4th June 2021 16:48 GMT stiine
    
    re: Cuddles: Re: Complexity: Another nail in the coffin...
    
    But that doesn't explain why they had crypto algorithms that would only decrpyx on the same cpu andcore that they were originally encrypted on.
    
    0 0 Reply
    1. Friday 4th June 2021 18:58 GMT Jon 37
      
      Re: re: Cuddles: Complexity: Another nail in the coffin...
      
      "Stream ciphers", one of the common kinds of encryption algorithm, work by taking a key and generating a long string of pseudo-random numbers from that key. That then gets XOR'd into the data.
      
      It's the same algorithm to encrypt and to decrypt. (Like how ROT13 is the same algorithm to encrypt and to decrypt, except a lot more secure).
      
      So it's certainly possible that a core bug results in the specific sequence of instructions in the pseudo-random-number generator part giving the wrong answer. And it's certainly possible that is reproducible, repeating it with the same key gives the same wrong answer each time.
      
      That would lead to the described behaviour - encrypting on the buggy core gives a different encryption from any other core, so only the buggy core can decrypt it.
      
      4 0 Reply
    2. Friday 4th June 2021 19:35 GMT bombastic bob
      
      Re: re: Cuddles: Complexity: Another nail in the coffin...
      
      maybe they need to use an encryption algorithm that isn't susceptible to (virtually) identical math errors during encryption and decryption. Then you could self-check by decrypting the encrypted output and comparing to the original. So long as the errors produce un-decryptable results, you should be fine.
      
      0 0 Reply
  4. Friday 4th June 2021 17:21 GMT Michael Wojcik
    
    Re: Complexity: Another nail in the coffin...
    
    it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs
    
    Well, it's also about how much of the time a given CPU (or rather each of its cores) is being used, since that's what gives you a result that might be incorrect. If a company "uses" a million cores but a given core is idle 90% of the time, they'll be much less likely to encounter a fault, obviously.
    
    So while "stressing" is probably not really an accurate term – it's not like they're using the CPUs outside their documented envelope (AFAIK) – "using more or less constantly" is a relevant qualification.
    
    2 0 Reply
Friday 4th June 2021 07:37 GMT Smudged

Evolution of the microchip

So Google have witnessed chips evolving to produce their own ransomware capability. What next? How far from Skynet are we?

17 0 Reply
Friday 4th June 2021 07:45 GMT amanfromMars 1

Just the cost of doing such business.

The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests.

If you consider chips as not too dissimilar from the networking of smarter humans, emerging anomalies are much easier to understand and be prepared for and accepted as just being an inherent endemic glitch always testing novel processes and processing there is no prior programming for.

And what if they are not simply errors but other possibilities available in other realities/times/spaces/virtually augmented places?

Are we struggling to make machines more like humans when we should be making humans more like machines….. IntelAIgent and CyberIntelAIgent Virtualised Machines?

Prime Digitization offers Realisable Benefits.

What is a computer other than a machine which we try to make think like us and/or for us? And what other model, to mimic/mirror could we possibly use, other than our own brain or something else SMARTR imagined?

And if through Deeper Thought, our Brain makes a Quantum Leap into another Human Understanding such as delivers Enlightened Views, does that mean that we can be and/or are Quantum Computers?

And is that likely to be a Feared and/or AWEsome Alien Territory?

3 4 Reply
1. Friday 4th June 2021 16:18 GMT OJay
  
  Re: Just the cost of doing such business.
  
  for once, I was able to follow this train of thought until the end.
  
  So, what does that make me. Mobile Autonomous quaNtum unit?
  
  0 0 Reply
  1. Friday 4th June 2021 19:42 GMT amanfromMars 1
    
    Re: Just the cost of doing such business.
    
    for once, I was able to follow this train of thought until the end.
    
    So, what does that make me. Mobile Autonomous quaNtum unit? ..... OJay>
    
    Gifted is a viable and pleasant thought, OJay, and would not be at all presumptuous. :-)
    
    1 3 Reply
Friday 4th June 2021 08:31 GMT Anonymous Coward

That's mercurial as in unpredictable, not Mercurial as in the version control system of the same name.

So what you're really saying is the version control system of the same name was aptly named.

3 0 Reply
Friday 4th June 2021 08:34 GMT Anonymous Coward

Once upon a time.....way back in another century......

......some of us (dimly) remember the idea of a standard development process:

1. Requirements (how quaint!!!)

2. Development

3. Unit Test

4. Functional Test

5. Volume Test (also rather quaint!!)

6. User Acceptance Test (you know...against item#1)

.....where #4, #5 and #6 might overlap somewhat in the timeline.

Another old fashioned idea was to have two (or three) separate installations (DEV, USER, PROD).......

......not sure how any of this old fashioned, twentieth century thinking fits in with "agile", "devops", "cloud"....and other "advanced" twenty first century thinking.

......but this article certainly makes this AC quite nostalgic for days past!

11 4 Reply
1. Friday 4th June 2021 08:37 GMT Ken Moorhouse
  
  Re: 5. Volume Test (also rather quaint!!)
  
  Every gig I've ever attended they never ever got past 2.
  
  10 0 Reply
  1. Friday 4th June 2021 15:04 GMT Arthur the cat
    
    Re: 5. Volume Test (also rather quaint!!)
    
    Every gig I've ever attended they never ever got past 2.
    
    <voice accent="yorkshire">You were lucky!</voice>
    
    9 0 Reply
2. Friday 4th June 2021 11:28 GMT Anonymous South African Coward
  
  Re: Once upon a time.....way back in another century......
  
  In the Elder Days, when things was Less Rushed, sure, you could take your time with a product, and deliver a product that lived up to its promises.
  
  Nowadays in these Younger Days everything is rushed to market (RTM) after a vigorous spit 'n polish and sugarcoating session to hide most of Them Nasteh Buggreh Bugs. And nary a peep of said TNBB's either... hoping said NTBB's won't manifest themselves until closer to the End Lifetime of the Product.
  
  Case in point - MCAS.
  
  ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.
  
  5 0 Reply
  1. Friday 4th June 2021 12:35 GMT Version 1.0
    
    Re: Once upon a time.....way back in another century......
    
    I never saw any problems with an 8080, 8085, 8048, or Z80 that I didn't create myself and fixed as soon as I saw the problem. Processors used to be completely reliable until the marketing and sales department start to want to add "features" which have lead to all of today's issues.
    
    4 1 Reply
  2. Friday 4th June 2021 15:07 GMT Arthur the cat
    
    Re: Once upon a time.....way back in another century......
    
    ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.
    
    On the other hand, back when, a friend of mine remarked that the Sinclair Scientific calculator was remarkably egalitarian, because if you didn't like the answer it gave you, you just had to squeeze the sides and it would give you a different one.
    
    7 0 Reply
  3. Friday 4th June 2021 22:29 GMT SCP
    
    Re: Once upon a time.....way back in another century......
    
    ASAC wrote:
    
    "ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption."
    
    Is that with or without the pint of milk on top to keep it cool enough?
    
    1 0 Reply
3. Friday 4th June 2021 14:53 GMT Primus Secundus Tertius
  
  Re: Once upon a time.....way back in another century......
  
  AC has described the ideal case.
  
  In practice, there were repeats of item 1 between items 2 and 3, 3 and 4, etc. Table-thumping customer managements and toadying contractor sales people.
  
  (S)He also omits a necessary step between 1 and 2, namely the software design. The requirements stated what was thought to be required - not always a correct piece of analysis. The software design says how you get there in terms of data structures and algorithms. Once software got past transcribing maths into FORTRAN the SD was essential.
  
  For CPUs, replace software with microcode. This was even more problematical than orthodox code.
  
  5 0 Reply
4. Monday 7th June 2021 01:39 GMT Doctor Syntax
  
  Re: Once upon a time.....way back in another century......
  
  7. Use in production.
  
  It's only in 7, and even then only at large scale that rare, sporadic failures become recognisable. Even if you were lucky enough to catch one at the previous stages you wouldn't be able to reproduce it reliably enough to understand it.
  
  0 0 Reply
Friday 4th June 2021 08:41 GMT Pascal Monett

"misbehaving cores"

Is the solution of replacing the CPU with another identical not a good idea, or will the new one start misbehaving in the same way ?

The article states that Google and FaceBook report a few cores in a thousand. That means that most CPUs are functioning just fine, so rip out the mercurial CPUs and replace them. That should give a chance of solving the immediate issue.

Of course, then you take the misbehaving CPU and ~~give it a good spanking~~, euh, put it in a test rig to find out just how it fails.

0 0 Reply
1. Friday 4th June 2021 08:51 GMT Blank Reg
  
  Re: "misbehaving cores"
  
  Replacing CPUs is the easy part. Detecting that a CPU needs replacing because it makes a mistake once every ten thousand hours is the hard part.
  
  19 0 Reply
  1. Friday 4th June 2021 09:43 GMT Neil Barnes
    
    Re: "misbehaving cores"
    
    Paraphrasing the jackpot computer in Robert Scheckley's Dimension of Miracles: I'm allowed to make one mistake in ten million and therefore not only I am going to, but I have to.
    
    4 0 Reply
2. Friday 4th June 2021 08:55 GMT Ken Moorhouse
  
  Re: "misbehaving cores"
  
  The question is whether this is at the CPU level, the board level, the box level, system level. Tolerances* for all of these things gives rise to unacceptable possibilities - don't forget at the board/box level you've got power supplies and, hopefully UPS's attached to those. How highly do these data centres/centers rate these seemingly mundane sub assemblies, for example? (I'm sure many of us here have had experiences with slightly wayward PSU's).
  
  *The old-fashioned "limits and fits" is to my mind a better illustration of how components work with each other.
  
  0 0 Reply
Friday 4th June 2021 08:47 GMT Red Ted

SETI saw result corruption too

The SETI project used to see work units with corrupted results and they double checked all results.

They attributed it to cosmic rays striking the micro and causing a bit flip.

8 0 Reply
1. Friday 4th June 2021 09:39 GMT Anonymous Coward
  
  They attributed it to cosmic rays striking the micro and causing a bit flip.
  
  It was just aliens hiding their presence. But they did on the checks too.
  
  16 0 Reply
2. Friday 4th June 2021 17:57 GMT Anonymous Coward
  
  Re: SETI saw result corruption too
  
  IIRC SETI also noticed that a lot of the corrupted results came from CPUs that had been overclocked.
  
  7 0 Reply
3. Sunday 6th June 2021 17:00 GMT Wokstation
  
  Neutrinos!
  
  They occasionally bump stuff and can flip a bit - we're building more and more surface area of microchip, so it's only natural that neutrino hits would be proportionally more common.
  
  0 0 Reply
Friday 4th June 2021 10:23 GMT Brewster's Angle Grinder

Poacher turned gamekeeper

I think they should hire that ransomware core as a cryptographer.

0 0 Reply
Friday 4th June 2021 11:56 GMT Binraider

Can we have ECC RAM supported by regular chipsets, please. Like we had certainly in the late 90's / early 2000's off the shelf. The sheer quantity of RAM and reduced tolerance to radiation means probability of bitflips are rather greater today than before.

Either AMD or Intel could put support back into consumer chipsets as an easy way to get an edge over competitors.

Regarding, CPU's, there's a reason satellite manufacturers are happy using a 20 year old architecture and manufacturing process at 200nm. Lower vulnerability to radiation-induced errors. (And using SRAM rather than DRAM too for same reason). Performance, cost, "tolerable" error. Rather less practical to roll back consumer performance (unless you fancy getting some genuinely efficient software out in circulation).

9 1 Reply
1. Friday 4th June 2021 16:23 GMT Claptrap314
  
  I'm sure that F & G are using ECC ram already. It's always been out there, but the marginal cost has been enough that (usually) retail consumers avoid it. But I recall it from the '80's.
  
  1 0 Reply
  1. Friday 4th June 2021 21:28 GMT Roland6
    
    >retail consumers avoid it.
    
    Err I think you'll find it is the manufacturers who avoid the use of ECC supporting chipsets in consumer products.
    
    4 0 Reply
    1. Monday 7th June 2021 23:49 GMT Claptrap314
      
      Retail consumers avoid the cost that manufacturers charge for it. So...sure.
      
      0 0 Reply
2. Monday 29th November 2021 07:56 GMT osmarks
  
  AMD does, I believe, but don't officially label it as supported.
  
  0 0 Reply
Friday 4th June 2021 12:04 GMT Wolfclaw

Pointless having a third core deciding on the tie breaker, the losing cores will simply call in the lawyers to overthrow the result and demand a recount.

13 0 Reply
Friday 4th June 2021 12:49 GMT dinsdale54

I have worked for a few hardware companies over the years and every single one has at some point had issues with random errors causing system crashes at above designed rates - these were all bit-flip errors.

In each case the people who noticed first were our biggest customers. In one of these cases they way they discovered the problem was products from two different companies exhibiting random errors. A quick look at both motherboards showed the same I/O chipset in use. Radioactive contamination in the chip packaging was the root cause.

You can mitigate these by putting multi-layer parity and ECC on every chip, bus and register with end-to-end checksumming. That will turn silent data corruption in to non-silent but it's also really expensive.

But at least let's have ECC as standard!

8 0 Reply
1. Friday 4th June 2021 17:18 GMT autopoiesis
  
  Radioactive contamination in the chip packaging - that's intriguing. I take it that wasn't determined during the 'quick look' phase ;)
  
  Nice find in any case - what was the contaminant, who found it, how long did it take etc?
  
  1 0 Reply
  1. Friday 4th June 2021 18:23 GMT dinsdale54
    
    I forget the exact details - this was over 10 years ago - but IIRC systems that had generated these errors were put in a radiation test chamber and radioactivity measured. Once you have demonstrated there's a problem then it's down to the chipset manufacturer to find the issue. I think it was just low level contamination in the packaging material that occasionally popped out an Alpha particle and could flip a bit.
    
    The remediation is a massive PITA. I think we were dealing with it for about 2 years from initial high failure rates to having all the faulty systems replaced.
    
    Over the years I have spent far more of my career dealing with these issues than I would like. I put in a big shift remediating Seagate MOOSE drives that had silent data corruption as well.
    
    2 0 Reply
Friday 4th June 2021 13:49 GMT Ilsa Loving

Minority report architecture

The only way to be sure would be to have at least 2 cores doing the same calculation each time. If they disagreed, run the calculation again. Alternatively you could have 3 cores doing the same calculation and if there's one core wrong then majority wins.

Or we finally move to a completely new technology like maybe optical chips.

1 0 Reply
Friday 4th June 2021 14:40 GMT steelpillow

Collecting rarities

"Rarely seen [things] crop up frequently"

I love that so much, I don't want it to be sanity-ised.

6 0 Reply
1. Sunday 6th June 2021 20:13 GMT David 132
  
  Re: Collecting rarities
  
  AKA “Million to one chances crop up nine times out of ten”, as Sir Pterry put it…
  
  2 0 Reply
Friday 4th June 2021 15:23 GMT Teejay

Brazil, here we come...

Tuttle Buttle coming nearer.

3 0 Reply
1. Friday 4th June 2021 18:31 GMT bazza
  
  Re: Brazil, here we come...
  
  How are your ducts?!
  
  1 0 Reply
Friday 4th June 2021 15:30 GMT bsimon

cosmic rays

Well, I used to blame "cosmic rays" for bugs in my code, now I have a new excuse: "mercurial cores" ...

10 0 Reply
Friday 4th June 2021 16:56 GMT Anonymous Coward

Sheer volume of data might also be part of the issue. I'm reminded of a comment from a Reg reader who worked somewhere that processed massive data sets. He said something like "million-to-one shots happen every night for us".

2 0 Reply
Friday 4th June 2021 17:03 GMT Persona

Networks can misbehave too

I've seen it happen to network traffic too. With data being sent on a network going half way around the world occasionally a few bits got changed. The deep analysis showed that interference was hitting part of the route and the network error detection was doing its job, detecting it and getting it resent. Very very very rarely the interference corrupted enough bits to pass the network level error check.

1 0 Reply
1. Sunday 6th June 2021 23:18 GMT Anonymous Coward
  
  Re: Networks can misbehave too
  
  One remote comms device was really struggling - and the transmission errors generated lots of new crashes in the network controller. The reason was the customer had the comms cable running across the floor of their arc welding workshop.
  
  A good test of a comms link was to wrap the cable a few times round a hair dryer - then switch it on. No matter how good the CRC - there is always a probability of a particular set of corrupt data passing it.
  
  0 0 Reply
Friday 4th June 2021 17:17 GMT Claptrap314

Been there, paid to do that

I did microprocessor validation at AMD & IBM for a decade about two decades ago.

I'm too busy to dig into these papers, but allow me to lay out what this sounds like. AMD never had problems of this sort while I was there (the validation team Dave Bass built was that good--by necessity.)

Many large customers of both companies found it worthwhile to have their own validation teams. Apple in particular had validation team that was frankly capable of finding more bugs than the IBM team did in the 750 era. (AMD's customers in the 486 & K5 era would tell them about the bugs that they found in Intel's parts & demand that we match them.)

Hard bugs are the ones that don't always happen--you can execute the same stream of instructions multiple times & get different results. This is almost certainly not the case for the "ransomware" bug. This rules out a lot of potential issues, including "cosmic rays" and "the Earth's magnetic field". (No BOFHs.)

The next big question is whether these parts behave like this from the time that they were manufacturers, or if they are the result of damage that accumulates during the lifetime of any microprocessor. Variations in manufacturing process can create either of these. We run tests before the dies are cut to catch the first case. For the latter, we do burn-in tests.

My first project at IBM was to devise a manufacturing test to catch a bug reported by Nintendo in about 3/1000 parts (AIR) during the 750 era. They wanted to find 75% of the bad parts. I took a bit longer than they wanted to isolate the bug, but my test came out 100% effective.

My point is that this has always been an issue. Manufacturers exists to make money. Burn-in tests are expensive to create--and even more expensive to run. You can work with your manufacturer about these issues or you can embarrass them. Sounds like F & G are going for the latter.

Oh, and I'm available. ;)

12 1 Reply
Friday 4th June 2021 17:18 GMT pwjone1

Error Checking and modern processor design

To some degree, undetected errors are to be more expected as chip lithography evolves (14nM to 10nM to 7nM to 6 and 5 or 3nM). There is a history of dynamic errors (crosstalk, XI, and other causes), and the susceptibility to these gets worse as the device geometries get smaller -- just fewer electrons that need to leak. Localized heating also becomes more of a problem the denser your get. Obviously Intel has struggled to get to 10nM, potentially also a factor. But generally x86 (and Atom) processor designs have not had much error checking, the design point is that the cores "just work", and as that gradually has become more and more problematical, it may be that Intel/AMD/Apple/etc. will need to revisit their approach. IBM, on higher end servers (z), generally includes error checking, this is done via various techniques like parity/ECC on internal data paths and caches, predictive error checking (for example, on a state machine, you predict/check the parity or check bits on the next state), and in some cases full redundancy (like laying down two copies of the ALU or cores, comparing the results each cycle). To be optimal, you also need some level of software recovery, and there are varying techniques there, too. Added error checking hardware also has its costs, generally 10-15% and a bit of cycle time, depending on how much you put in and how exactly it is in implemented. So in a way, it is not too much of a surprise that Google (and others) have observed "Mercurial" results, any hardware designer would have shrugged and said "What did you expect? You get what you pay for."

4 0 Reply
Friday 4th June 2021 17:20 GMT Draco

What sort of bloody dystopian Orwellian-tak is this?

"Our adventure began as vigilant production teams increasingly complained of recidivist machines corrupting data," said Peter Hochschild.

How were the machines disciplined? Were they given a warning? Did the machines take and pass the requisite unconscious bias training courses?

"These machines were credibly accused of corrupting multiple different stable well-debugged large-scale applications. Each machine was accused repeatedly by independent teams but conventional diagnostics found nothing wrong with them."

Did the machines have counsel? Were these accusations proven? Can the machines sue for slander and libel if the accusations are shown to be false?

-----------

English is my second language and those statements are truly mind numbing to read. In a certain context, they might be seen as humorous.

What is wrong with the statements being written something more like:

"Our adventure began as vigilant production teams increasingly observed machines corrupting data," said Peter Hochschild.

"These machines were repeatedly observed corrupting multiple different stable well-debugged large-scale applications. Multiple independent teams noted corruptions by these machines even though conventional diagnostics found nothing wrong with them."

7 0 Reply
1. Saturday 5th June 2021 04:21 GMT amanfromMars 1
  
  Re: What sort of bloody dystopian Orwellian-tak is this?
  
  Nice one, Draco. Have a worthy upvote for your contribution to the El Reg Think Tank.
  
  Transfer that bloody dystopian Orwellian-tak to the many fascist and nationalistic geo-political spheres which mass multi media and terrifying statehoods are responsible for presenting, ...... and denying routinely they be held accountable for ....... and one of the solutions for machines to remedy the situation is to replace and destroy/change and recycle/blitz and burn existing prime established drivers /seditious instruction sets.
  
  Whether that answer would finally deliver a contribution for retribution and recalibration in any solution to the question that corrupts and perverts and subverts the metadata is certainly worth exploring and exploiting any time the issue veers towards destructively problematical and systemically paralysing and petrifying.
  
  Some things are just turned plain bad and need to be immediately replaced, old decrepit and exhausted tired for brand spanking new and tested in new spheres of engagement for increased performance and reassuring reliability/guaranteed stability.
  
  Out with the old and In with the new for a new dawn and welcoming beginning.
  
  0 2 Reply
Friday 4th June 2021 17:21 GMT Michael Wojcik

ObIT

That's mercurial as in unpredictable, not Mercurial as in delay lines.

1 0 Reply
Friday 4th June 2021 17:26 GMT Nightkiller

Even CPUs are sensitive to Critical Race Theory. At some point you have to expect them to share their lived experience.

2 0 Reply
Friday 4th June 2021 17:29 GMT Claptrap314

Buggy processors--that work!

The University of Michigan made a report (with the above title) around 1998 regarding research that they had done with a self-checking microprocessor. Thier design was to have a fully out-of-order processor do the computations and compare each part of the computation against an in-order core that was "led" by the out-of-order design. (The entire in-order core could be formally validated.) When there was a miscompare, the instruction would be re-run through the in-order core without reference to the "leading" of the out-of-order core. In this fashion, bugs in the out-of-order core would be turned into slowdowns.

In order for the design to function appropriately, the in-order core required a level-0 cache. The result was that overall execution speed actually increased slightly. (AIR, this was because the out-of-order core was permitted to fetch from the L0 cache.)

The design did not attract much attention at AMD. I assume that was mostly because our performance was so much beyond what our team believe this design could reach.

Sadly, such a design does nothing to block SPECTER-class problems.

In any event, the final issue is cost. F & G are complaining about how much cost they have to bear.

1 0 Reply
1. Saturday 5th June 2021 23:01 GMT runt row raggy
  
  Re: Buggy processors--that work!
  
  i must be missing something. if checking is required, isn't the timing limited to how fast the slowest (in-order) processor can work?
  
  0 0 Reply
  1. Monday 7th June 2021 23:58 GMT Claptrap314
    
    Re: Buggy processors--that work!
    
    You would think so, wouldn't you?
    
    The design broke up an instruction into parts--instruction fetch, operand fetch, result computation, result store. (It's been >20 years--I might have this wrong.) The in-order core executed these four stages in parallel. It could do this because of the preliminary work of the out-of-order processor. The out-of-order core might take 15 cycles to do all four steps, but the in-order core does it in one--in no small part due to that L0. The in-order core was being drafted by the out-of-order core to the point that it could manage a higher IPC than the out-of-order core--as long as the data was available, which it often was not, of course.
    
    0 0 Reply
Friday 4th June 2021 18:33 GMT bazza

Sigh...

I think there's a few people who need to read up on the works of Claude Shannon and Benoit Mandelbrot...

1 0 Reply
Friday 4th June 2021 18:59 GMT DS999

How do they know this is new?

They only found it after a lot of investigation ruled out other causes. It may have been true in years past but no one had enough CPUs running the same code in the same place for long enough that they could tease out the root cause.

So linking it to today's advanced processes may point us in the wrong direction, unless we can say for sure this wasn't happening with Pentium Pros and PA-RISC 8000s 25 years ago.

I assume they have sent some of the suspect CPUs to Intel for them to take an electron microscope to look at the cores that exhibit the problem so they try to determine if it is some type of manufacturing variation, "wear", or something no one could prepare for like a one in a septillion neutrino collision with a nucleus changing the electrical characteristics of a single transistor by just enough that an edge condition error affecting that transistor becomes a one in a quadrillion chance?

If they did, and Intel figures out why those cores go bad, will it ever become public? Or will Google and Intel treat it as a "competitive advantage" over others?

2 0 Reply
1. Friday 4th June 2021 22:52 GMT SCP
  
  Re: How do they know this is new?
  
  I can't recall a case related to CPUs, but there were definitely cases like pattern sensitive RAM; resolved when the root cause was identified and design tools modified to avoid the issue.
  
  The "good old days" were not as perfect as our rose tinted glasses might lead us to recall.
  
  2 0 Reply
  1. Tuesday 8th June 2021 00:04 GMT Claptrap314
    
    Re: How do they know this is new?
    
    We had a case of a power signal coupling a high bit in an address line leading out of the L1 in the 750. Stopped shipping product to Apple for a bit. Nasty, NASTY bug.
    
    I don't recall what exactly was the source of the manufacturing defect was on the Nintendo bug, but it only affected certain cells in the L2. Once you knew which ones to hit, it was easy to target them. Until I worked it out, though... Uggh.
    
    0 0 Reply
2. Saturday 5th June 2021 06:51 GMT bazza
  
  Re: How do they know this is new?
  
  Silicon chips do indeed wear. There's a phenomenon I've heard termed "electron wind" which causes the atoms of the element used to dope the silicon (which is what makes a junction) to be moved across that junction. Eventually they disperse throughout the silicon and then there's no junction at all.
  
  This is all related to current, temperature and time. More of any these makes the wearing effect faster.
  
  Combine the slowness with which that happens, and the effects of noise, temperature and voltage margins on whether a junction is operating as desired, and I reckon you can get effects where apparently odd behaviour can be quasi stable.
  
  0 0 Reply
  1. Tuesday 8th June 2021 00:00 GMT Claptrap314
    
    Re: How do they know this is new?
    
    I deliberately avoided the term "hole migration" because it tends to cause people's heads to explode, but yeah.
    
    And not just quasi-stable. The effects of hole migration can be VERY predictable. Eventually, the processor becomes inert!
    
    0 0 Reply
Friday 4th June 2021 19:33 GMT Anonymous Coward

what data got corrupted, exactly?

While the issue of rogue cores is certainly important, since they could possibly do bad things to useful stuff (e.g., healthcare records, first-response systems), I wonder if this will pop up later in a shareholders meeting about why click-throughs (or whatever they use to price advert space) are down. "Um, it was, ahh, data corruption, due to, ahm, processor cores, that's it."

1 0 Reply
Friday 4th June 2021 19:42 GMT Bitsminer

Reminds of silent disk corruption a few years ago

Google Peter Kelemen at CERN; he was part of a team that identified a high rate of disk data corruption amongst the thousands of rotating disk drives at CERN. This was back in 2007.

Among the root causes, the disks were a bit "mercurial" about writing data into the correct sector. Sometimes it got written to the wrong cylinder, track, and block. That kind of corruption, even on a RAID5 set, results in a write-loss of correct data that ultimately can invalidate a complete file system.

Reasoning is as follows: write a sector to the wrong place (data), and write the matching RAID-5 parity to the right place. Later, read the data back and get a RAID-5 parity error. To fix the error, rewrite the (valid, but old) data back in place because the parity is a mismatch. Meanwhile, the correct data at the wrong place lives on. When that gets read, the parity error is detected and the original (and valid) data is rewritten. The net net of this: loss of the written sector. If this is file system metadata, it can break the file system integrity.

2 0 Reply
1. Saturday 5th June 2021 06:59 GMT bazza
  
  Re: Reminds of silent disk corruption a few years ago
  
  Yes I remember that.
  
  It was for reasons like this that Sun developed the ZFS file system. It has error checking and correction up and down the file system, designed to probably give error free operations over exabyte filesystems.
  
  Modern storage devices are close to being such that if you read the whole device twice, you will not get the same bits returned. 1 will be wrong.
  
  3 0 Reply
Friday 4th June 2021 19:52 GMT captain veg

so...

... we've already got quantum computing, and no one noticed?

-A.

6 0 Reply
1. Friday 4th June 2021 21:42 GMT LateAgain
  
  Re: so...
  
  We may, or may not, have noticed.
  
  4 0 Reply
2. Tuesday 8th June 2021 00:06 GMT Claptrap314
  
  Re: so...
  
  You do know what a transistor is, correct?
  
  0 0 Reply
Friday 4th June 2021 21:08 GMT Throatwarbler Mangrove

For shame!

No one has noticed the opportunity for a new entry in the BOFH Excuse Calendar?

5 0 Reply
1. Tuesday 8th June 2021 00:07 GMT Claptrap314
  
  Re: For shame!
  
  Just another page for the existing one, my friend.
  
  0 0 Reply
Friday 4th June 2021 21:36 GMT itzman

well its obvious

that as a data one gets represented by e,g, less and less electrons, statistically the odd data one will fall below the threshold for being a one, and become a zero, and vice versa.

or a stray cosmic ray will flip a flop.

or enough electrons will tunnel their way to freedom....

2 0 Reply
Friday 4th June 2021 21:36 GMT rcxb

Higher datacenter temperatures contributing?

One has to wonder if the sauna-like temperatures Google and Facebook are increasingly running their datacenters at, is contributing to the increased rate of CPU-core glitches.

They may be monitoring CPU temperatures to ensure they don't exceed the spec sheet maximums, but any real-world device doesn't have a vertical cliff dropoff, and the more extreme conditions it operates in, the sooner some kind of failure can be expected. The speedometer in my car goes significantly into the tripple-digits, but I wouldn't be shocked if driving it like a race car would result in mechanical problems rather sooner in its life-cycle.

Similarly, high temperatures are frequently used to simulate years of ageing with various equipment.

1 0 Reply
1. Tuesday 8th June 2021 00:11 GMT Claptrap314
  
  Re: Higher datacenter temperatures contributing?
  
  I was never privileged to tour our datacenters, but I am HIGHLY confident that G is careful to run the chips in-spec. When you've got that many millions of processors in the barn, a 1% failure rate is E.X.P.E.N.S.I.V.E.
  
  Now, for decades, that spec is a curve and not a point. (IE: don't run over this speed at this temperature, or that speed at that temperature) This means that "in spec" is a bit more broad the naive' approach might guess.
  
  They also have temperature monitors to trigger a shutdown if the temp spikes too high for too long. They test these monitors.
  
  2 0 Reply
Friday 4th June 2021 21:51 GMT Anonymous Coward

Um. Clockspeed

Surprised no one has mentioned

As a core ages it will struggle to maintain high clocks and turbo speeds

For A core that was marginal when new but passed initial Val it's not surprising to see it start to behave unpredictability as it gets older. You see it all the time in overclocked systems, but normally the CPU is running an OS so it'll BSOD on a Driver before it starts to actually make serious mistakes in an app. A lone core running compute in a server it's not surprising that it'll start to misstep and would align with their findings

Identify mercurial cores and drop.the clocks by 10%, not rocket science

0 0 Reply
1. Saturday 5th June 2021 06:09 GMT Ken Moorhouse
  
  Re: Um. Clockspeed. Surprised no one has mentioned
  
  I have mentioned both Clock Skew and the interaction of tolerances ("Limits and Fits") between components/sub-assemblies in this topic earlier.
  
  ===
  
  The problem with Voting systems is that the integrity of Clock systems has to be complete. The Clock for all elements of the Voting system has to be such that there is no chance that the results from one element are received outside of the clock period to ensure they are included in this "tick's" vote. If included in the next "tick's" vote then not only does it affect the result for this "tick", but the next "tick" too, which is susceptible to a deleterious cascade effect. I'm assuming that it is prudent to have three separate teams of developers, with no shared libraries, for a 2 in 3 voting system to eliminate the effect of common-mode design principles, which might fail to fault on errors.
  
  If applying Voting systems to an asynchronous system, such as TCP/IP messaging (where out-of-band packet responses are integral to the design of the system), how do you set time-outs? If they are set too strict then you get the deleterious snowball effect, bringing down the whole system. Too slack and you might just as well use legacy technology.
  
  3 0 Reply
Friday 4th June 2021 22:37 GMT Snowy

Google and Facebook designed CPUs

Google and Facebook started to use CPUs that they had designed themselves are these the ones that are producing errors?

2 0 Reply
1. Tuesday 8th June 2021 16:27 GMT Claptrap314
  
  Re: Google and Facebook designed CPUs
  
  As far as I can tell, those really are more like SOCs and/or GPUs. And, no, they are complaining about someone else's work.
  
  0 0 Reply
Friday 4th June 2021 22:44 GMT They call me Mr Nick

Floating Point Fault

Many years ago while at University I heard of an interesting fault.

A physicist had run the same program a few days apart but had got different results. And had noticed.

Upon investigation it transpired that the floating point unit of the ICL mainframe was giving incorrect answeres in the 4th decimal place.

This was duly fixed. But I wondered at the time how many interesting discoveries in physics were actually undiscovered hardware failures.

0 0 Reply
1. Saturday 5th June 2021 06:24 GMT Ken Moorhouse
  
  Re: Floating Point Fault. had got different results
  
  That makes it nondeterministic, which is subtley different to giving incorrect (yet consistent) answers in the 4th decimal place.
  
  Maybe the RAM needed to be flushed each time the physicist's program was run. Perhaps the physicist was not explicitly initialising variables before use.
  
  3 0 Reply
  1. Monday 7th June 2021 01:39 GMT Doctor Syntax
    
    Re: Floating Point Fault. had got different results
    
    "Perhaps the physicist was not explicitly initialising variables before use."
    
    FORTRAN -
    
    SOMEWARIABLE = 0
    
    Later on
    
    SOMEVARIABLE = SOMEVARIABLE + X
    
    1 0 Reply
    1. Monday 7th June 2021 05:54 GMT Ken Moorhouse
      
      Re: SOMEWARIABLE = 0
      
      One of the reasons I like Delphi so much is that cannot easily happen without ignoring compilation warnings, and keeping global declarations to the absolute minimum.
      
      0 0 Reply
    2. Monday 7th June 2021 12:12 GMT Binraider
      
      Re: Floating Point Fault. had got different results
      
      Python, for all of it's usefulness, weak type checking is a major bugbear. Especially when you have a couple thousand variable names to work with.
      
      1 1 Reply
2. Tuesday 8th June 2021 16:28 GMT Claptrap314
  
  Re: Floating Point Fault
  
  This is why science involves others reproducing your work.
  
  0 0 Reply
Friday 4th June 2021 22:55 GMT John Savard

Somebody thought of this before

Of course, though, IBM has been designing their mainframe CPUs with built-in error detection all along. And it still does, even though they're on microchips.

1 0 Reply
1. Tuesday 8th June 2021 00:13 GMT Claptrap314
  
  Re: Somebody thought of this before
  
  That's mostly a different issue. But yeah, some of those server parts have more silicon to detect problems (not just errors) than is in the cores.
  
  0 0 Reply
Sunday 6th June 2021 05:54 GMT Paddy

You get what you paid for.

Doctor, doctor, it hurts if I use this core!

Then use one of your millions of others?

But I need all that I buy!

Then buy safer, ASIL D, automotive spec chips.

But they cost more!

So you're cheap? NEXT!

Chip lifetime bathtub curves are statistical in nature. When you run that many cpu's, their occasional failures might be expected; and failures don't need to be reproducible.

2 0 Reply
Sunday 6th June 2021 10:49 GMT yogidude

An oldie but a goodie

I am Pentium of Borg.

You will be approximated

4 1 Reply
1. Tuesday 8th June 2021 00:15 GMT Claptrap314
  
  Re: An oldie but a goodie
  
  The Pentium Bug was a design error. The microcode was programmed with a 0 in a table that was supposed to be something else. That's a completely different issue from this discussion.
  
  That chip did exactly what it was designed to do. (Some) of these do not.
  
  0 0 Reply
  1. Tuesday 8th June 2021 11:40 GMT yogidude
    
    Re: An oldie but a goodie
    
    The original bug in eniac was also not part of the design, but gave its name to what we now refer to generically as a flaw in the operation of the device/software whatever the cause. That said, since posting the above I realised I omitted a line from the original (early 90s) joke.
    
    I am Pentium of Borg.
    
    Division is futile.
    
    You will be approximated.
    
    0 0 Reply
Sunday 6th June 2021 20:51 GMT Piro

CPU lockstep processing

Somewhere, there are some old greybeard engineers that developed systems like NonStop that are shaking their heads, finally, we've hit a scenario in which having multiple CPUs running in lockstep would solve the issue!

At least 3, so you can eliminate the "bad" core, generate an alarm, and keep on running.

But no, everyone thought that specialist systems with all kinds of error checking and redundancy were wild overkill, and everything could be achieved by lots of commodity level hardware

1 0 Reply
1. Monday 7th June 2021 16:20 GMT Charles 9
  
  Re: CPU lockstep processing
  
  Then we get the scenario when two of them agree but on the wrong answer...
  
  1 0 Reply
  1. Monday 7th June 2021 17:25 GMT Ken Moorhouse
    
    Re: Then we get the scenario when two of them agree but on the wrong answer...
    
    Yes. Seen this with a well-known accounts package. One error on the audit trail. Ran the data checker which assumed that the erroneous transaction was correct, and everything else was wrong.
    
    0 0 Reply
2. Tuesday 8th June 2021 00:17 GMT Claptrap314
  
  Re: CPU lockstep processing
  
  You every try designing a board like that? Triple the buses, 10x the fun! (And by fun, I mean electrical interference between buses.) Such a solution would be REALLY expensive.
  
  0 0 Reply
Sunday 6th June 2021 23:10 GMT Anonymous Coward

I had a 40 year career diagnosing apparently random IT problems that people said "couldn't happen". Electrical noise; physical noise; static; "in spec" voltages that weren't tight enough in all cases; cosmic particles; the sun shining. English Electric had to ban salted peanuts from the vending machines because the circuit board production line workers liked them.

Murphy's Law always applies to any window of opportunity: "If anything can go wrong - it will go wrong". Plus the Sod's Law rider "..at the worst possible time".

Back in the 1960s my boss had a pragmatic approach about apparent mainframe bugs "Once is not a problem - twice and it becomes a problem".

1 0 Reply
Monday 7th June 2021 04:49 GMT daveyeager@gmail.com

Isn’t this just another one of those things IBM knew about decades ago? I’m pretty sure they built in all these redundancies into their mainframes to catch exactly these types of very rare hardware malfunctions. And now the new kids on the block are like “look what we’ve discovered”

Things I want to know: are they becoming more common because more cores means a higher odds one will go haywire? Or is it because the latest manufacturing nodes are less reliable? Are per transistor defect rates actually increasing? How much more prevalent are these errors compared to the past for the same number of cores or servers? Lots of unanswered questions in this article.

1 0 Reply
Monday 7th June 2021 09:42 GMT anonymous boring coward

"One of our mercurial cores corrupted encryption," he explained. "It did it in such a way that only it could decrypt what it had wrongly encrypted."

Of course Skynet would want us to think it's just accidental...

1 0 Reply
Monday 7th June 2021 11:25 GMT Jason Hindle

I see this ending only one way

A corrupted processor core and your computer attempting to use your smart home (or pacemaker) to murder you in the middle of the night.

0 0 Reply
Monday 7th June 2021 12:16 GMT Anonymous Coward

L3 cache ECC errors

We've had a mystery case of some (brand new) Intel i7 NUCs suffering correctable and non-correctable level-3-cache ECC errors (giving a Machine Check Error) in the past 18 months.

Only happens when they're heavily stressed throwing (video) data around.

Seems to be hardware specific - get a "bad batch" which are problematic, showing up errors every few hours, or few 10's of hours.

Same software runs for 100's or thousands of hours on other hardware specimens with no issue.

Related? I don't know.

2 0 Reply
1. Tuesday 8th June 2021 00:23 GMT Claptrap314
  
  Re: L3 cache ECC errors
  
  This does sound similar. Seriously, it might be worth a sit-down to talk through, even though it's been 15 years since this was my job. I don't know if you are big enough to merit an account rep with Intel or not, but if you are, be sure to complain--those parts are defective, and need to be replaced by Intel. (And Intel _should_ be pretty interested in getting their hands on some failing examples.)
  
  1 0 Reply
  1. This post has been deleted by its author
  2. Tuesday 8th June 2021 15:29 GMT Anonymous Coward
    
    Re: L3 cache ECC errors
    
    We tried talking to Intel but initially they gave us the run-abouts ("we don't support Linux" etc). (We probably buy a few hundred per year - but growing exponentially)
    
    As far as I know we're still effectively "screening" new NUCs by running our code for 48-72 hours, and any that fault in that time are put to one side and not sent to customers. Any that report faults (via our telemetry) in the field will be swapped out at the earliest opportunity.
    
    Apparently 6 months or so after we started seeing the problem we did get some traction from Intel, sent them a NUC to analyse, and a few months later a BIOS update was issued - which fixed the problem on the NUCs we knew to be flaky at the time. Intel's BIOS release note cryptically says "Fixed issues with Machine Check Errors / Reboot while running game engine".
    
    I understand we've seen some occasional MCEs since the new BIOS on some specimens, although they may have different root cause...
    
    0 0 Reply
    1. Tuesday 8th June 2021 16:51 GMT Claptrap314
      
      Re: L3 cache ECC errors
      
      Ouch, ouch, and OUCH!
      
      If you are seeing a steady problem while ordering less than a thousand CPUs, then this is a HUGE deal. A 1/1000 escape should stop the line. Seriously, talk to your management about going to the press with this.
      
      BECAUSE you are way, way too small for Intel to care about. And..Intel's parts are everywhere. This isn't going to just affect gamers & miners.
      
      As to what you can do in the mean time, a couple of things come to mind immediately, sorry if you are already going there.
      
      1) Double errors happen roughly at the square of the rate of single errors. Screen on corrected bits over a certain level, don't wait for the MCE.
      
      2) Check that you are actually running the parts in-spec. I know this sounds insane, but if your workload drives the temperature of the core too high, then you are running it out of spec. Of course, manufacturers do significant work to predict what the highest workload-driven effect can be, but don't trust it. Also, read the specs on the temperature sensors on the part very carefully. Running the core at X temp is not the same has having the senors report X temp.
      
      3) It sounds like it would be worthwhile to spend some time reducing your burn-in run. (I come from the manufacturer's side of things, so the economics are WAY different than I'm used to, but still...)
      
      a) Part of why I wanted you to check the temp spec so carefully is that if you are sure that you are in spec, you might be able to run at a slightly higher ambient temp & still be in spec. Why would you want to do this? Because the fails will happen faster if you do.
      
      b) Try to identify which part(s) of your workload are triggering the fails, and just run that part over & over. I had test code that could trigger the 750 Medal of Honor bug after 8-10 hours. Eventually, I got it to fail in 1 second.
      
      c) Try to see if there is a commonality to the memory locations that fail. As I mentioned with the Nintendo bug for the 750, it might be possible to target just a handful of cache lines to activate the failure.
      
      1 0 Reply
Friday 8th October 2021 17:11 GMT BPontius

Mercurial, so basic predictable human behavior. Imagine the problems that will crop up if/when quantum computers become of such complexity? The computing error within a qubit with multiple values before collapsing the super positioning and the possibility of quantum entanglement between qubits.

0 0 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Other stories you might like

Google fires 28 staff after sit-in protest against Israeli cloud deal ends in arrests

Alphabet Workers Union says bosses refuse to listen to concerns

Off-Prem 18 Apr 2024 | 94

Google Cloud chief is really psyched about this AI thing

Cloud Next We're on a highway to ML

PaaS + IaaS 9 Apr 2024 | 6

YouTube now sabotages ad-blocking apps that stream its vids

EFF lambastes latest 'lazy and deliberately malicious move'

Applications 16 Apr 2024 | 67

Google One VPN axed for everyone but Pixel loyalists ... for now

Another one bytes the dust

Personal Tech 12 Apr 2024 | 8

Chrome Enterprise Premium promises extra security – for a fee

Cloud Next Paying for browsers is no longer a memory from the 1990s

Security 10 Apr 2024 | 33

Protest group says Google has fired more staff over sit-ins opposing work for Israel

Group of now-ex Googlers say 50 folks have been let go, vow ongoing protests

Off-Prem 22 Apr 2024 | 58

Tokyo wags finger at Google for blocking Yahoo Japan! from using ad tech

Seven years of stonewalling and no consequences for advertising giant

Off-Prem 22 Apr 2024 | 3

Google squashes AI teams together in push for fresh models

You can leave your personal vendettas at home – we have work to do, Pichai warns

AI + ML 19 Apr 2024 | 11

Google location tracking deal could be derailed by politics

$62 million settlement plan challenged over payments to progressive nonprofits

Security 16 Apr 2024 | 17

Google will pump more than $100B into AI, says DeepMind boss

Not all at once, of course

AI + ML 17 Apr 2024 | 5

UK data watchdog questions how private Google's Privacy Sandbox is

Leaked draft report says stated goals still come up short

Security 22 Apr 2024 | 2

Google laying off staff again and moving some roles to 'hubs,' freeing up cash for AI investments

Restructure of finance teams will see some leave, and other roles created in Mexico City, Bangalore, and US cities

Off-Prem 18 Apr 2024 | 9

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024