Somehow this will all end with Intel having part of the blame...
FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof
Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …
COMMENTS
-
-
Friday 4th June 2021 08:57 GMT imanidiot
Since some parts of modern x86 and x64 chip designs result directly or indirectly from decisions Intel has made in the past, it's likely at least some small part of the blame will lie with Intel. Whether they should have known better (like with the whole IME and predictive threading debacle) remains to be seen.
-
-
Friday 4th June 2021 12:55 GMT Cynic_999
Re: Allow one to disable a core
But that assumes that an error in a particular core has been detected in the first place. Where errors would have serious consequences, I would say the best policy would be to have at least 2 different cores or CPUs running the same code in parallel, and checking that the results match. Using 3 cores/CPUs would be better and allow the errant device to be detected. Best for each to have their own RAM as well.
-
-
Sunday 6th June 2021 20:30 GMT Malcolm 5
Re: Allow one to disable a core
that comment about encryption that can only be reversed by the same core intrigued me - maybe I am thinking too advanced but is anything other that XOR encryption really so symmetric that a processor bug would hit encryption and decryption "the same"
I guess it could be that there was a stream calculation was doing the same thing and feeding into an XOR with the data
-
-
-
-
-
Friday 4th June 2021 01:06 GMT Richard Boyce
Error detection
We've long had ECC RAM available, but only really critical tasks have had CPU redundancy for detecting and removing errors. Maybe it's time for that to change. As chips have more and more cores added, perhaps we could usefully use an option to tie cores together in threes to do the same tasks with majority voting to determine the output.
-
-
-
-
Friday 4th June 2021 11:46 GMT juice
Re: Error detection
>As in minority report...
I think that "voting" concept has been used in a few places - including, if memory serves, the three "Magi" in Neon Genesis Evangelion.
https://wiki.evageeks.org/Magi
There's even a relatively obscure story about a Bolo (giant sentient tanks), in which the AI's multi-core hardware is failing, and it has to bring a human along for the ride while fighting aliens, since there's a risk that it'll end up stuck with an even number of "votes" and will need to ask the human to act as a tie-breaker...
-
-
Friday 4th June 2021 19:09 GMT bombastic bob
Re: Error detection
without revealing [classified information] the concept of "2 out of 3" needed to initiate something, such as [classified information], might even use an analog means of doing so, and pre-dates the space shuttle [and Evangelion] by more than just a few years.
Definitely a good idea for critical calculations, though.
-
Friday 4th June 2021 22:28 GMT martinusher
Re: Error detection
Two out of three redundancy is as old as the hills. It can be made a bit more reliable by having different systems arrive at the result -- instead of three (or more) identical boxes you distribute the work among different systems so that the likelihood of an error showing up in more than one system is minimized.
The problem with this sort of approach is not just bulk but time -- like any deliberative process you have to achieve a consensus to do anything which inevitably delays the outcome.
-
-
Friday 4th June 2021 22:30 GMT General Purpose
Re: Error detection
Something like this?
During timecritical mission phases (i.e., recovery time less than one second), such as boost, reentry, and landing, four of these computers operate as a redundant set, receiving the same input data, performing the same flight-critical computations, and transmitting the same output commands.(The fifth computer performs non-critical computations.) In this mode of operation, comparison of output commands and “voting” on the results in the redundant set provide the basis for efficient detection and identification of two flight-critical computer failures. After two failures, the remaining two computers in the set use comparison and self-test techniques to provide tolerance of a third fault.
-
-
-
-
-
-
-
Friday 4th June 2021 08:00 GMT cyberdemon
Re: Error detection
Nah, they'll just hide the errors under layer upon inscrutable layer of neural network, and a few arithmetic glitches will probably benefit the model as a whole.
So instead of being a function of its input and training data, and coming to a conclusion like "black person == criminal" it will say something like "bork bork bork, today's unperson of the day is.. Richard B
uttleoyce" -
-
-
Friday 4th June 2021 19:22 GMT bombastic bob
Re: Error detection
I dunno about half speed... but certainly limit the operating temperature.
more than likely it's caused by running at higher than average temperatures (that are still below the limit) which cause an increase in hole/electron migration within the gates [from entropy] and they become weakened and occasionally malfunction...
(at higher temperatures, entropy is higher, and therefore migration as well)
I'm guessing that these malfunctioning devices had been run at very high temperatures, almost continuously, for a long period of time [years even]. Even though the chip spec allows temperatures to be WAY hotter than they usually run at, it's probably not a good idea to LET this happen in order to save money on cooling systems (or for any other reason related to this).
On several occasions I've seen overheated devices malfunction [requiring replacement]. In some cases it was due to bad manufacturing practices (an entire run of bad boards with dead CPUs). I would expect that repeated exposure to maximum temperatures over a long period of time would eventually have the same effect.
-
-
-
Friday 4th June 2021 15:39 GMT Anonymous Coward
Re: Error detection
That is a lot of silicon being dedicated to a problem that can be solved with less.
It's possible and has been done where I used to work, Sussex uni, to implement error checking in logic gates. A researcher around 20 years ago was generating chip designs for error checking, finding the least number of gates needed, reducing the current, at the time, design size.
He was back then using a GA to produce needed layouts and found many more efficient ones than was currently in use (provided them free to use). This could be applied to cpus, and is for critical systems, but as it uses more silicon, isn't in consumer cpus as that adds cost for no performance gain.
-
Friday 4th June 2021 16:26 GMT EveryTime
Re: Error detection
CPU redundancy has been around almost since the beginning of electronic computing, but it largely disappeared in the early 1990s as caching and asynchronous interrupts made cycle-by-cycle comparison infeasible.
My expectation is that this will turn out to be another in a long history of misunderstanding faults. It's seeing a specific design error and mistaking it for a general technology limit.
My first encounter of this was when dynamic RAM was suffering from high fault rates. I read many stories on how the limit of feature size had been reached. The older generation had been reliable, so the speculation was that the new smaller memory capacitors had crossed the threshold where every cosmic ray would flip bits. I completely believed those stories. Then the next round of stories reported that the actual problem was the somewhat radioactive ceramic used for the chip packaging. Using to a different source for ceramic avoided the problem, and it was a motivation to simply change to less expensive plastic packages.
The same thing happened repeatedly over the years in supercomputing/HPC. Researchers thought that they spotted disturbing trends in the largest installed systems. What they found was always a specific solvable problem, not a general reliability limit to scaling.
-
Friday 4th June 2021 16:55 GMT Warm Braw
Re: Error detection
The approach adopted by Tandem Computers was to duplicate everything - including memory and persistent storage -- as you can get bus glitches, cache glitches and all sorts of other transient faults in "shared" components which you would not otherwise be able to detect simply from core coupling. But even that doesn't necessarily protect against systematic errors where every instance of (say) the processor makes the same mistake repeatably.
It's a difficult problem: and don't forget that many peripherals will also have processors in them, it's not just the main CPU you have to look out for.
-
Friday 4th June 2021 19:04 GMT bombastic bob
Re: Error detection
CPU redundancy may be easier than people may want to admit...
If your CPU has multiple (actual) cores, for "critical" operations you could run two parallel threads. If your threads can be assigned "CPU affinity" such that they don't hop from CPU to CPU as tasks switch around then you can compare the results to make sure they match. If you're REALLY paranoid, you can use more than 2 threads.
If it's a VM then the hypervisor (or emulator, or whatever) would need to be able to ensure that core to thread affinity is supported.
-
Sunday 6th June 2021 21:59 GMT Anonymous Coward
Re: Error detection and elimination
"only really critical tasks have had CPU redundancy for detecting and removing errors. "
Tandem Nonstop mean anything to you?
Feed multiple nominally identical computer systems the same set of inputs and if they don't have the same outputs something's gone wrong (massively oversimplified).
Lockstep at IO level rather than instruction level (how does instruction-level lockstep deal with things like soft errors in cache memory, that can be corrected but are unlikely to occur simultaneously on two or more systems being compared).
Anyway, it's mostly been done before. Just not by the Intel/Windows world.
-
Monday 7th June 2021 11:00 GMT Tom 7
Re: Error detection
Could cause more problems than it solves. If all three cores are close to each other on the die and the error is one of the 'field type' (where lots of certain activity in a certain area of the chip causes the problem) then all three cores could fall for the same problem and provide identical incorrect results thus giving the illusion all is ok,
-
-
-
Friday 4th June 2021 14:39 GMT Irony Deficient
Maybe the Google boffins can learn a few techniques from them.
Ximénez: Now, old woman — you are accused of heresy on three counts: heresy by thought, heresy by word, heresy by deed, and heresy by action — four counts. Do you confess?
Wilde: I don’t understand what I’m accused of.
Ximénez: Ha! Then we’ll make you understand! Biggles! Fetch … the cushions!