Somehow this will all end with Intel having part of the blame...
FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof
Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …
COMMENTS
-
-
Friday 4th June 2021 08:57 GMT imanidiot
Since some parts of modern x86 and x64 chip designs result directly or indirectly from decisions Intel has made in the past, it's likely at least some small part of the blame will lie with Intel. Whether they should have known better (like with the whole IME and predictive threading debacle) remains to be seen.
-
-
Friday 4th June 2021 12:55 GMT Cynic_999
Re: Allow one to disable a core
But that assumes that an error in a particular core has been detected in the first place. Where errors would have serious consequences, I would say the best policy would be to have at least 2 different cores or CPUs running the same code in parallel, and checking that the results match. Using 3 cores/CPUs would be better and allow the errant device to be detected. Best for each to have their own RAM as well.
-
-
Sunday 6th June 2021 20:30 GMT Malcolm 5
Re: Allow one to disable a core
that comment about encryption that can only be reversed by the same core intrigued me - maybe I am thinking too advanced but is anything other that XOR encryption really so symmetric that a processor bug would hit encryption and decryption "the same"
I guess it could be that there was a stream calculation was doing the same thing and feeding into an XOR with the data
-
-
-
-
-
Friday 4th June 2021 01:06 GMT Richard Boyce
Error detection
We've long had ECC RAM available, but only really critical tasks have had CPU redundancy for detecting and removing errors. Maybe it's time for that to change. As chips have more and more cores added, perhaps we could usefully use an option to tie cores together in threes to do the same tasks with majority voting to determine the output.
-
-
-
-
Friday 4th June 2021 11:46 GMT juice
Re: Error detection
>As in minority report...
I think that "voting" concept has been used in a few places - including, if memory serves, the three "Magi" in Neon Genesis Evangelion.
https://wiki.evageeks.org/Magi
There's even a relatively obscure story about a Bolo (giant sentient tanks), in which the AI's multi-core hardware is failing, and it has to bring a human along for the ride while fighting aliens, since there's a risk that it'll end up stuck with an even number of "votes" and will need to ask the human to act as a tie-breaker...
-
-
Friday 4th June 2021 19:09 GMT bombastic bob
Re: Error detection
without revealing [classified information] the concept of "2 out of 3" needed to initiate something, such as [classified information], might even use an analog means of doing so, and pre-dates the space shuttle [and Evangelion] by more than just a few years.
Definitely a good idea for critical calculations, though.
-
Friday 4th June 2021 22:28 GMT martinusher
Re: Error detection
Two out of three redundancy is as old as the hills. It can be made a bit more reliable by having different systems arrive at the result -- instead of three (or more) identical boxes you distribute the work among different systems so that the likelihood of an error showing up in more than one system is minimized.
The problem with this sort of approach is not just bulk but time -- like any deliberative process you have to achieve a consensus to do anything which inevitably delays the outcome.
-
-
Friday 4th June 2021 22:30 GMT General Purpose
Re: Error detection
Something like this?
During timecritical mission phases (i.e., recovery time less than one second), such as boost, reentry, and landing, four of these computers operate as a redundant set, receiving the same input data, performing the same flight-critical computations, and transmitting the same output commands.(The fifth computer performs non-critical computations.) In this mode of operation, comparison of output commands and “voting” on the results in the redundant set provide the basis for efficient detection and identification of two flight-critical computer failures. After two failures, the remaining two computers in the set use comparison and self-test techniques to provide tolerance of a third fault.
-
-
-
-
-
-
-
Friday 4th June 2021 08:00 GMT cyberdemon
Re: Error detection
Nah, they'll just hide the errors under layer upon inscrutable layer of neural network, and a few arithmetic glitches will probably benefit the model as a whole.
So instead of being a function of its input and training data, and coming to a conclusion like "black person == criminal" it will say something like "bork bork bork, today's unperson of the day is.. Richard B
uttleoyce" -
-
-
Friday 4th June 2021 19:22 GMT bombastic bob
Re: Error detection
I dunno about half speed... but certainly limit the operating temperature.
more than likely it's caused by running at higher than average temperatures (that are still below the limit) which cause an increase in hole/electron migration within the gates [from entropy] and they become weakened and occasionally malfunction...
(at higher temperatures, entropy is higher, and therefore migration as well)
I'm guessing that these malfunctioning devices had been run at very high temperatures, almost continuously, for a long period of time [years even]. Even though the chip spec allows temperatures to be WAY hotter than they usually run at, it's probably not a good idea to LET this happen in order to save money on cooling systems (or for any other reason related to this).
On several occasions I've seen overheated devices malfunction [requiring replacement]. In some cases it was due to bad manufacturing practices (an entire run of bad boards with dead CPUs). I would expect that repeated exposure to maximum temperatures over a long period of time would eventually have the same effect.
-
-
-
Friday 4th June 2021 15:39 GMT Anonymous Coward
Re: Error detection
That is a lot of silicon being dedicated to a problem that can be solved with less.
It's possible and has been done where I used to work, Sussex uni, to implement error checking in logic gates. A researcher around 20 years ago was generating chip designs for error checking, finding the least number of gates needed, reducing the current, at the time, design size.
He was back then using a GA to produce needed layouts and found many more efficient ones than was currently in use (provided them free to use). This could be applied to cpus, and is for critical systems, but as it uses more silicon, isn't in consumer cpus as that adds cost for no performance gain.
-
Friday 4th June 2021 16:26 GMT EveryTime
Re: Error detection
CPU redundancy has been around almost since the beginning of electronic computing, but it largely disappeared in the early 1990s as caching and asynchronous interrupts made cycle-by-cycle comparison infeasible.
My expectation is that this will turn out to be another in a long history of misunderstanding faults. It's seeing a specific design error and mistaking it for a general technology limit.
My first encounter of this was when dynamic RAM was suffering from high fault rates. I read many stories on how the limit of feature size had been reached. The older generation had been reliable, so the speculation was that the new smaller memory capacitors had crossed the threshold where every cosmic ray would flip bits. I completely believed those stories. Then the next round of stories reported that the actual problem was the somewhat radioactive ceramic used for the chip packaging. Using to a different source for ceramic avoided the problem, and it was a motivation to simply change to less expensive plastic packages.
The same thing happened repeatedly over the years in supercomputing/HPC. Researchers thought that they spotted disturbing trends in the largest installed systems. What they found was always a specific solvable problem, not a general reliability limit to scaling.
-
Friday 4th June 2021 16:55 GMT Warm Braw
Re: Error detection
The approach adopted by Tandem Computers was to duplicate everything - including memory and persistent storage -- as you can get bus glitches, cache glitches and all sorts of other transient faults in "shared" components which you would not otherwise be able to detect simply from core coupling. But even that doesn't necessarily protect against systematic errors where every instance of (say) the processor makes the same mistake repeatably.
It's a difficult problem: and don't forget that many peripherals will also have processors in them, it's not just the main CPU you have to look out for.
-
Friday 4th June 2021 19:04 GMT bombastic bob
Re: Error detection
CPU redundancy may be easier than people may want to admit...
If your CPU has multiple (actual) cores, for "critical" operations you could run two parallel threads. If your threads can be assigned "CPU affinity" such that they don't hop from CPU to CPU as tasks switch around then you can compare the results to make sure they match. If you're REALLY paranoid, you can use more than 2 threads.
If it's a VM then the hypervisor (or emulator, or whatever) would need to be able to ensure that core to thread affinity is supported.
-
Sunday 6th June 2021 21:59 GMT Anonymous Coward
Re: Error detection and elimination
"only really critical tasks have had CPU redundancy for detecting and removing errors. "
Tandem Nonstop mean anything to you?
Feed multiple nominally identical computer systems the same set of inputs and if they don't have the same outputs something's gone wrong (massively oversimplified).
Lockstep at IO level rather than instruction level (how does instruction-level lockstep deal with things like soft errors in cache memory, that can be corrected but are unlikely to occur simultaneously on two or more systems being compared).
Anyway, it's mostly been done before. Just not by the Intel/Windows world.
-
Monday 7th June 2021 11:00 GMT Tom 7
Re: Error detection
Could cause more problems than it solves. If all three cores are close to each other on the die and the error is one of the 'field type' (where lots of certain activity in a certain area of the chip causes the problem) then all three cores could fall for the same problem and provide identical incorrect results thus giving the illusion all is ok,
-
-
-
Friday 4th June 2021 14:39 GMT Irony Deficient
Maybe the Google boffins can learn a few techniques from them.
Ximénez: Now, old woman — you are accused of heresy on three counts: heresy by thought, heresy by word, heresy by deed, and heresy by action — four counts. Do you confess?
Wilde: I don’t understand what I’m accused of.
Ximénez: Ha! Then we’ll make you understand! Biggles! Fetch … the cushions!
-
Friday 4th June 2021 07:37 GMT Ken Moorhouse
Complexity: Another nail in the coffin...
...for the cloud.
Before anyone posts the obvious rebuttal: note this phrase "two of the world's larger CPU stressors, Google and Facebook".
If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time.
I grew up with the concept of "the clock pulse". If we're pushing synchronous data to the limits (rise-fall time of data wrt clock pulses) then you could arguably get a skew effect. If designers are in denial about that then there are big problems ahead. (Rowhammer is a related problem).
-
Friday 4th June 2021 10:23 GMT Brewster's Angle Grinder
Not all electrons are made equal...
To me this sounds like quantum effects. No manufacturing process produces exact replicas; there is going to be subtle variation between chips. I don't know anything about modern chip design and manufacture so can't speculate what it could be. But electron behaviour is just the law of averages. And so whatever these defects are, it means electrons can periodically jump where they shouldn't. The smaller the currents, the fewer party* electrons are needed for this to become significant.
* The party number is one of the important quantum numbers. It determines how likely an electron is to be an outlier. It's normally represented as a mullet.
-
-
Friday 4th June 2021 12:42 GMT Brewster's Angle Grinder
Forbidden gates
But these are process variations that are being missed by manufacturers and where the chip generally functions as required. Just every once in a while it goes haywire. You could call it fate. You could call it luck. You could call it Karma. You could say it's mercurial or capricious. Or you could suspect some process variation allows tunnelling with low probability, or that some other odd transition or excitation is happening.
-
Friday 4th June 2021 17:18 GMT Anonymous Coward
Re: Forbidden gates
It's just down to the statistics of very rare events with very large N. If you have a reliable processor with a clock speed of 10^9 hertz that gives you just one error every 10^20 clocks, then you can expect an error every 3000 years or so, say a one in five hundred or a thousand chance of seeing a single error during the 3-6 year life of the system. I can live with that for my laptop.
But if you buy a million of those processors and run them in parallel in data centres then you will see roughly an error every day.
-
Friday 4th June 2021 22:41 GMT General Purpose
Re: Forbidden gates
The trouble is that those errors aren't evenly spread. Specific individual cores go bad. The chances are against you having one of those in your laptop or one of your on-premises server, but if you do have one then you may experience a series of mysterious crashes, incorrect calculations and/or data loss, not just one incident.
-
-
-
-
Friday 4th June 2021 11:23 GMT Cuddles
Re: Complexity: Another nail in the coffin...
"If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time."
I don't think it's anything to do with CPU time, but simply the number of CPUs. As the article notes, it's a few problematic cores per several thousand CPUs, ie. it's not random failures due to the large amount of use, it's some specific cores that have a problem. But since the problems are rare, only people operating many thousands of them are likely to actually encounter them. So it's a bit misleading to call them "stressors" of CPUs; it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs.
So it's hard to say if on-prem would be better or not. On the one hand, you're unlikely to have enough CPUs to actually have a problem. But if you get unlucky and you do, the problematic core will be a greater percentage of your computing, and you're unlikely to be able to actually spot it at all. On the other hand, being assigned different CPUs every time you run a task in the cloud makes it almost inevitable that you'll encounter a troublesome core at some point. But it's unlikely to be a persistent problem since you won't have the same core next time, and the companies operating at that scale are able to assign the resources to actually find the problem.
-
-
Friday 4th June 2021 18:49 GMT Jon 37
Re: Complexity: Another nail in the coffin...
No, because of the way crypto is designed. Any miner who tries to submit a mined block, will have it tested by every other node on the network. If the miner's system glitched, then the block just won't be accepted. And this sounds rare enough that a miner would just shrug and move onto the next block.
-
-
-
Friday 4th June 2021 18:58 GMT Jon 37
Re: re: Cuddles: Complexity: Another nail in the coffin...
"Stream ciphers", one of the common kinds of encryption algorithm, work by taking a key and generating a long string of pseudo-random numbers from that key. That then gets XOR'd into the data.
It's the same algorithm to encrypt and to decrypt. (Like how ROT13 is the same algorithm to encrypt and to decrypt, except a lot more secure).
So it's certainly possible that a core bug results in the specific sequence of instructions in the pseudo-random-number generator part giving the wrong answer. And it's certainly possible that is reproducible, repeating it with the same key gives the same wrong answer each time.
That would lead to the described behaviour - encrypting on the buggy core gives a different encryption from any other core, so only the buggy core can decrypt it.
-
Friday 4th June 2021 19:35 GMT bombastic bob
Re: re: Cuddles: Complexity: Another nail in the coffin...
maybe they need to use an encryption algorithm that isn't susceptible to (virtually) identical math errors during encryption and decryption. Then you could self-check by decrypting the encrypted output and comparing to the original. So long as the errors produce un-decryptable results, you should be fine.
-
-
Friday 4th June 2021 17:21 GMT Michael Wojcik
Re: Complexity: Another nail in the coffin...
it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs
Well, it's also about how much of the time a given CPU (or rather each of its cores) is being used, since that's what gives you a result that might be incorrect. If a company "uses" a million cores but a given core is idle 90% of the time, they'll be much less likely to encounter a fault, obviously.
So while "stressing" is probably not really an accurate term – it's not like they're using the CPUs outside their documented envelope (AFAIK) – "using more or less constantly" is a relevant qualification.
-
-
Friday 4th June 2021 07:45 GMT amanfromMars 1
Just the cost of doing such business.
The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests.
If you consider chips as not too dissimilar from the networking of smarter humans, emerging anomalies are much easier to understand and be prepared for and accepted as just being an inherent endemic glitch always testing novel processes and processing there is no prior programming for.
And what if they are not simply errors but other possibilities available in other realities/times/spaces/virtually augmented places?
Are we struggling to make machines more like humans when we should be making humans more like machines….. IntelAIgent and CyberIntelAIgent Virtualised Machines?
Prime Digitization offers Realisable Benefits.
What is a computer other than a machine which we try to make think like us and/or for us? And what other model, to mimic/mirror could we possibly use, other than our own brain or something else SMARTR imagined?
And if through Deeper Thought, our Brain makes a Quantum Leap into another Human Understanding such as delivers Enlightened Views, does that mean that we can be and/or are Quantum Computers?
And is that likely to be a Feared and/or AWEsome Alien Territory?
-
Friday 4th June 2021 08:34 GMT Anonymous Coward
Once upon a time.....way back in another century......
......some of us (dimly) remember the idea of a standard development process:
1. Requirements (how quaint!!!)
2. Development
3. Unit Test
4. Functional Test
5. Volume Test (also rather quaint!!)
6. User Acceptance Test (you know...against item#1)
.....where #4, #5 and #6 might overlap somewhat in the timeline.
Another old fashioned idea was to have two (or three) separate installations (DEV, USER, PROD).......
......not sure how any of this old fashioned, twentieth century thinking fits in with "agile", "devops", "cloud"....and other "advanced" twenty first century thinking.
......but this article certainly makes this AC quite nostalgic for days past!
-
Friday 4th June 2021 11:28 GMT Anonymous South African Coward
Re: Once upon a time.....way back in another century......
In the Elder Days, when things was Less Rushed, sure, you could take your time with a product, and deliver a product that lived up to its promises.
Nowadays in these Younger Days everything is rushed to market (RTM) after a vigorous spit 'n polish and sugarcoating session to hide most of Them Nasteh Buggreh Bugs. And nary a peep of said TNBB's either... hoping said NTBB's won't manifest themselves until closer to the End Lifetime of the Product.
Case in point - MCAS.
ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.
-
Friday 4th June 2021 12:35 GMT Version 1.0
Re: Once upon a time.....way back in another century......
I never saw any problems with an 8080, 8085, 8048, or Z80 that I didn't create myself and fixed as soon as I saw the problem. Processors used to be completely reliable until the marketing and sales department start to want to add "features" which have lead to all of today's issues.
-
Friday 4th June 2021 15:07 GMT Arthur the cat
Re: Once upon a time.....way back in another century......
ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.
On the other hand, back when, a friend of mine remarked that the Sinclair Scientific calculator was remarkably egalitarian, because if you didn't like the answer it gave you, you just had to squeeze the sides and it would give you a different one.
-
-
Friday 4th June 2021 14:53 GMT Primus Secundus Tertius
Re: Once upon a time.....way back in another century......
AC has described the ideal case.
In practice, there were repeats of item 1 between items 2 and 3, 3 and 4, etc. Table-thumping customer managements and toadying contractor sales people.
(S)He also omits a necessary step between 1 and 2, namely the software design. The requirements stated what was thought to be required - not always a correct piece of analysis. The software design says how you get there in terms of data structures and algorithms. Once software got past transcribing maths into FORTRAN the SD was essential.
For CPUs, replace software with microcode. This was even more problematical than orthodox code.
-
Monday 7th June 2021 01:39 GMT Doctor Syntax
Re: Once upon a time.....way back in another century......
7. Use in production.
It's only in 7, and even then only at large scale that rare, sporadic failures become recognisable. Even if you were lucky enough to catch one at the previous stages you wouldn't be able to reproduce it reliably enough to understand it.
-
Friday 4th June 2021 08:41 GMT Pascal Monett
"misbehaving cores"
Is the solution of replacing the CPU with another identical not a good idea, or will the new one start misbehaving in the same way ?
The article states that Google and FaceBook report a few cores in a thousand. That means that most CPUs are functioning just fine, so rip out the mercurial CPUs and replace them. That should give a chance of solving the immediate issue.
Of course, then you take the misbehaving CPU and
give it a good spanking, euh, put it in a test rig to find out just how it fails.-
Friday 4th June 2021 08:55 GMT Ken Moorhouse
Re: "misbehaving cores"
The question is whether this is at the CPU level, the board level, the box level, system level. Tolerances* for all of these things gives rise to unacceptable possibilities - don't forget at the board/box level you've got power supplies and, hopefully UPS's attached to those. How highly do these data centres/centers rate these seemingly mundane sub assemblies, for example? (I'm sure many of us here have had experiences with slightly wayward PSU's).
*The old-fashioned "limits and fits" is to my mind a better illustration of how components work with each other.
-
Friday 4th June 2021 11:56 GMT Anonymous Coward
Can we have ECC RAM supported by regular chipsets, please. Like we had certainly in the late 90's / early 2000's off the shelf. The sheer quantity of RAM and reduced tolerance to radiation means probability of bitflips are rather greater today than before.
Either AMD or Intel could put support back into consumer chipsets as an easy way to get an edge over competitors.
Regarding, CPU's, there's a reason satellite manufacturers are happy using a 20 year old architecture and manufacturing process at 200nm. Lower vulnerability to radiation-induced errors. (And using SRAM rather than DRAM too for same reason). Performance, cost, "tolerable" error. Rather less practical to roll back consumer performance (unless you fancy getting some genuinely efficient software out in circulation).
-
Friday 4th June 2021 12:49 GMT dinsdale54
I have worked for a few hardware companies over the years and every single one has at some point had issues with random errors causing system crashes at above designed rates - these were all bit-flip errors.
In each case the people who noticed first were our biggest customers. In one of these cases they way they discovered the problem was products from two different companies exhibiting random errors. A quick look at both motherboards showed the same I/O chipset in use. Radioactive contamination in the chip packaging was the root cause.
You can mitigate these by putting multi-layer parity and ECC on every chip, bus and register with end-to-end checksumming. That will turn silent data corruption in to non-silent but it's also really expensive.
But at least let's have ECC as standard!
-
-
Friday 4th June 2021 18:23 GMT dinsdale54
I forget the exact details - this was over 10 years ago - but IIRC systems that had generated these errors were put in a radiation test chamber and radioactivity measured. Once you have demonstrated there's a problem then it's down to the chipset manufacturer to find the issue. I think it was just low level contamination in the packaging material that occasionally popped out an Alpha particle and could flip a bit.
The remediation is a massive PITA. I think we were dealing with it for about 2 years from initial high failure rates to having all the faulty systems replaced.
Over the years I have spent far more of my career dealing with these issues than I would like. I put in a big shift remediating Seagate MOOSE drives that had silent data corruption as well.
-
-
-
Friday 4th June 2021 13:49 GMT Ilsa Loving
Minority report architecture
The only way to be sure would be to have at least 2 cores doing the same calculation each time. If they disagreed, run the calculation again. Alternatively you could have 3 cores doing the same calculation and if there's one core wrong then majority wins.
Or we finally move to a completely new technology like maybe optical chips.
-
-
Friday 4th June 2021 17:03 GMT Persona
Networks can misbehave too
I've seen it happen to network traffic too. With data being sent on a network going half way around the world occasionally a few bits got changed. The deep analysis showed that interference was hitting part of the route and the network error detection was doing its job, detecting it and getting it resent. Very very very rarely the interference corrupted enough bits to pass the network level error check.
-
Sunday 6th June 2021 23:18 GMT Anonymous Coward
Re: Networks can misbehave too
One remote comms device was really struggling - and the transmission errors generated lots of new crashes in the network controller. The reason was the customer had the comms cable running across the floor of their arc welding workshop.
A good test of a comms link was to wrap the cable a few times round a hair dryer - then switch it on. No matter how good the CRC - there is always a probability of a particular set of corrupt data passing it.
-
-
Friday 4th June 2021 17:17 GMT Claptrap314
Been there, paid to do that
I did microprocessor validation at AMD & IBM for a decade about two decades ago.
I'm too busy to dig into these papers, but allow me to lay out what this sounds like. AMD never had problems of this sort while I was there (the validation team Dave Bass built was that good--by necessity.)
Many large customers of both companies found it worthwhile to have their own validation teams. Apple in particular had validation team that was frankly capable of finding more bugs than the IBM team did in the 750 era. (AMD's customers in the 486 & K5 era would tell them about the bugs that they found in Intel's parts & demand that we match them.)
Hard bugs are the ones that don't always happen--you can execute the same stream of instructions multiple times & get different results. This is almost certainly not the case for the "ransomware" bug. This rules out a lot of potential issues, including "cosmic rays" and "the Earth's magnetic field". (No BOFHs.)
The next big question is whether these parts behave like this from the time that they were manufacturers, or if they are the result of damage that accumulates during the lifetime of any microprocessor. Variations in manufacturing process can create either of these. We run tests before the dies are cut to catch the first case. For the latter, we do burn-in tests.
My first project at IBM was to devise a manufacturing test to catch a bug reported by Nintendo in about 3/1000 parts (AIR) during the 750 era. They wanted to find 75% of the bad parts. I took a bit longer than they wanted to isolate the bug, but my test came out 100% effective.
My point is that this has always been an issue. Manufacturers exists to make money. Burn-in tests are expensive to create--and even more expensive to run. You can work with your manufacturer about these issues or you can embarrass them. Sounds like F & G are going for the latter.
Oh, and I'm available. ;)
-
Friday 4th June 2021 17:18 GMT pwjone1
Error Checking and modern processor design
To some degree, undetected errors are to be more expected as chip lithography evolves (14nM to 10nM to 7nM to 6 and 5 or 3nM). There is a history of dynamic errors (crosstalk, XI, and other causes), and the susceptibility to these gets worse as the device geometries get smaller -- just fewer electrons that need to leak. Localized heating also becomes more of a problem the denser your get. Obviously Intel has struggled to get to 10nM, potentially also a factor. But generally x86 (and Atom) processor designs have not had much error checking, the design point is that the cores "just work", and as that gradually has become more and more problematical, it may be that Intel/AMD/Apple/etc. will need to revisit their approach. IBM, on higher end servers (z), generally includes error checking, this is done via various techniques like parity/ECC on internal data paths and caches, predictive error checking (for example, on a state machine, you predict/check the parity or check bits on the next state), and in some cases full redundancy (like laying down two copies of the ALU or cores, comparing the results each cycle). To be optimal, you also need some level of software recovery, and there are varying techniques there, too. Added error checking hardware also has its costs, generally 10-15% and a bit of cycle time, depending on how much you put in and how exactly it is in implemented. So in a way, it is not too much of a surprise that Google (and others) have observed "Mercurial" results, any hardware designer would have shrugged and said "What did you expect? You get what you pay for."
-
Friday 4th June 2021 17:20 GMT Draco
What sort of bloody dystopian Orwellian-tak is this?
"Our adventure began as vigilant production teams increasingly complained of recidivist machines corrupting data," said Peter Hochschild.
How were the machines disciplined? Were they given a warning? Did the machines take and pass the requisite unconscious bias training courses?
"These machines were credibly accused of corrupting multiple different stable well-debugged large-scale applications. Each machine was accused repeatedly by independent teams but conventional diagnostics found nothing wrong with them."
Did the machines have counsel? Were these accusations proven? Can the machines sue for slander and libel if the accusations are shown to be false?
-----------
English is my second language and those statements are truly mind numbing to read. In a certain context, they might be seen as humorous.
What is wrong with the statements being written something more like:
"Our adventure began as vigilant production teams increasingly observed machines corrupting data," said Peter Hochschild.
"These machines were repeatedly observed corrupting multiple different stable well-debugged large-scale applications. Multiple independent teams noted corruptions by these machines even though conventional diagnostics found nothing wrong with them."
-
Saturday 5th June 2021 04:21 GMT amanfromMars 1
Re: What sort of bloody dystopian Orwellian-tak is this?
Nice one, Draco. Have a worthy upvote for your contribution to the El Reg Think Tank.
Transfer that bloody dystopian Orwellian-tak to the many fascist and nationalistic geo-political spheres which mass multi media and terrifying statehoods are responsible for presenting, ...... and denying routinely they be held accountable for ....... and one of the solutions for machines to remedy the situation is to replace and destroy/change and recycle/blitz and burn existing prime established drivers /seditious instruction sets.
Whether that answer would finally deliver a contribution for retribution and recalibration in any solution to the question that corrupts and perverts and subverts the metadata is certainly worth exploring and exploiting any time the issue veers towards destructively problematical and systemically paralysing and petrifying.
Some things are just turned plain bad and need to be immediately replaced, old decrepit and exhausted tired for brand spanking new and tested in new spheres of engagement for increased performance and reassuring reliability/guaranteed stability.
Out with the old and In with the new for a new dawn and welcoming beginning.
-
-
Friday 4th June 2021 17:29 GMT Claptrap314
Buggy processors--that work!
The University of Michigan made a report (with the above title) around 1998 regarding research that they had done with a self-checking microprocessor. Thier design was to have a fully out-of-order processor do the computations and compare each part of the computation against an in-order core that was "led" by the out-of-order design. (The entire in-order core could be formally validated.) When there was a miscompare, the instruction would be re-run through the in-order core without reference to the "leading" of the out-of-order core. In this fashion, bugs in the out-of-order core would be turned into slowdowns.
In order for the design to function appropriately, the in-order core required a level-0 cache. The result was that overall execution speed actually increased slightly. (AIR, this was because the out-of-order core was permitted to fetch from the L0 cache.)
The design did not attract much attention at AMD. I assume that was mostly because our performance was so much beyond what our team believe this design could reach.
Sadly, such a design does nothing to block SPECTER-class problems.
In any event, the final issue is cost. F & G are complaining about how much cost they have to bear.
-
-
Monday 7th June 2021 23:58 GMT Claptrap314
Re: Buggy processors--that work!
You would think so, wouldn't you?
The design broke up an instruction into parts--instruction fetch, operand fetch, result computation, result store. (It's been >20 years--I might have this wrong.) The in-order core executed these four stages in parallel. It could do this because of the preliminary work of the out-of-order processor. The out-of-order core might take 15 cycles to do all four steps, but the in-order core does it in one--in no small part due to that L0. The in-order core was being drafted by the out-of-order core to the point that it could manage a higher IPC than the out-of-order core--as long as the data was available, which it often was not, of course.
-
-
-
Friday 4th June 2021 18:59 GMT DS999
How do they know this is new?
They only found it after a lot of investigation ruled out other causes. It may have been true in years past but no one had enough CPUs running the same code in the same place for long enough that they could tease out the root cause.
So linking it to today's advanced processes may point us in the wrong direction, unless we can say for sure this wasn't happening with Pentium Pros and PA-RISC 8000s 25 years ago.
I assume they have sent some of the suspect CPUs to Intel for them to take an electron microscope to look at the cores that exhibit the problem so they try to determine if it is some type of manufacturing variation, "wear", or something no one could prepare for like a one in a septillion neutrino collision with a nucleus changing the electrical characteristics of a single transistor by just enough that an edge condition error affecting that transistor becomes a one in a quadrillion chance?
If they did, and Intel figures out why those cores go bad, will it ever become public? Or will Google and Intel treat it as a "competitive advantage" over others?
-
Friday 4th June 2021 22:52 GMT SCP
Re: How do they know this is new?
I can't recall a case related to CPUs, but there were definitely cases like pattern sensitive RAM; resolved when the root cause was identified and design tools modified to avoid the issue.
The "good old days" were not as perfect as our rose tinted glasses might lead us to recall.
-
Tuesday 8th June 2021 00:04 GMT Claptrap314
Re: How do they know this is new?
We had a case of a power signal coupling a high bit in an address line leading out of the L1 in the 750. Stopped shipping product to Apple for a bit. Nasty, NASTY bug.
I don't recall what exactly was the source of the manufacturing defect was on the Nintendo bug, but it only affected certain cells in the L2. Once you knew which ones to hit, it was easy to target them. Until I worked it out, though... Uggh.
-
-
Saturday 5th June 2021 06:51 GMT bazza
Re: How do they know this is new?
Silicon chips do indeed wear. There's a phenomenon I've heard termed "electron wind" which causes the atoms of the element used to dope the silicon (which is what makes a junction) to be moved across that junction. Eventually they disperse throughout the silicon and then there's no junction at all.
This is all related to current, temperature and time. More of any these makes the wearing effect faster.
Combine the slowness with which that happens, and the effects of noise, temperature and voltage margins on whether a junction is operating as desired, and I reckon you can get effects where apparently odd behaviour can be quasi stable.
-
-
Friday 4th June 2021 19:33 GMT Anonymous Coward
what data got corrupted, exactly?
While the issue of rogue cores is certainly important, since they could possibly do bad things to useful stuff (e.g., healthcare records, first-response systems), I wonder if this will pop up later in a shareholders meeting about why click-throughs (or whatever they use to price advert space) are down. "Um, it was, ahh, data corruption, due to, ahm, processor cores, that's it."
-
Friday 4th June 2021 19:42 GMT Bitsminer
Reminds of silent disk corruption a few years ago
Google Peter Kelemen at CERN; he was part of a team that identified a high rate of disk data corruption amongst the thousands of rotating disk drives at CERN. This was back in 2007.
Among the root causes, the disks were a bit "mercurial" about writing data into the correct sector. Sometimes it got written to the wrong cylinder, track, and block. That kind of corruption, even on a RAID5 set, results in a write-loss of correct data that ultimately can invalidate a complete file system.
Reasoning is as follows: write a sector to the wrong place (data), and write the matching RAID-5 parity to the right place. Later, read the data back and get a RAID-5 parity error. To fix the error, rewrite the (valid, but old) data back in place because the parity is a mismatch. Meanwhile, the correct data at the wrong place lives on. When that gets read, the parity error is detected and the original (and valid) data is rewritten. The net net of this: loss of the written sector. If this is file system metadata, it can break the file system integrity.
-
Saturday 5th June 2021 06:59 GMT bazza
Re: Reminds of silent disk corruption a few years ago
Yes I remember that.
It was for reasons like this that Sun developed the ZFS file system. It has error checking and correction up and down the file system, designed to probably give error free operations over exabyte filesystems.
Modern storage devices are close to being such that if you read the whole device twice, you will not get the same bits returned. 1 will be wrong.
-
-
Friday 4th June 2021 21:36 GMT itzman
well its obvious
that as a data one gets represented by e,g, less and less electrons, statistically the odd data one will fall below the threshold for being a one, and become a zero, and vice versa.
or a stray cosmic ray will flip a flop.
or enough electrons will tunnel their way to freedom....
-
Friday 4th June 2021 21:36 GMT rcxb
Higher datacenter temperatures contributing?
One has to wonder if the sauna-like temperatures Google and Facebook are increasingly running their datacenters at, is contributing to the increased rate of CPU-core glitches.
They may be monitoring CPU temperatures to ensure they don't exceed the spec sheet maximums, but any real-world device doesn't have a vertical cliff dropoff, and the more extreme conditions it operates in, the sooner some kind of failure can be expected. The speedometer in my car goes significantly into the tripple-digits, but I wouldn't be shocked if driving it like a race car would result in mechanical problems rather sooner in its life-cycle.
Similarly, high temperatures are frequently used to simulate years of ageing with various equipment.
-
Tuesday 8th June 2021 00:11 GMT Claptrap314
Re: Higher datacenter temperatures contributing?
I was never privileged to tour our datacenters, but I am HIGHLY confident that G is careful to run the chips in-spec. When you've got that many millions of processors in the barn, a 1% failure rate is E.X.P.E.N.S.I.V.E.
Now, for decades, that spec is a curve and not a point. (IE: don't run over this speed at this temperature, or that speed at that temperature) This means that "in spec" is a bit more broad the naive' approach might guess.
They also have temperature monitors to trigger a shutdown if the temp spikes too high for too long. They test these monitors.
-
-
Friday 4th June 2021 21:51 GMT Anonymous Coward
Um. Clockspeed
Surprised no one has mentioned
As a core ages it will struggle to maintain high clocks and turbo speeds
For A core that was marginal when new but passed initial Val it's not surprising to see it start to behave unpredictability as it gets older. You see it all the time in overclocked systems, but normally the CPU is running an OS so it'll BSOD on a Driver before it starts to actually make serious mistakes in an app. A lone core running compute in a server it's not surprising that it'll start to misstep and would align with their findings
Identify mercurial cores and drop.the clocks by 10%, not rocket science
-
Saturday 5th June 2021 06:09 GMT Ken Moorhouse
Re: Um. Clockspeed. Surprised no one has mentioned
I have mentioned both Clock Skew and the interaction of tolerances ("Limits and Fits") between components/sub-assemblies in this topic earlier.
===
The problem with Voting systems is that the integrity of Clock systems has to be complete. The Clock for all elements of the Voting system has to be such that there is no chance that the results from one element are received outside of the clock period to ensure they are included in this "tick's" vote. If included in the next "tick's" vote then not only does it affect the result for this "tick", but the next "tick" too, which is susceptible to a deleterious cascade effect. I'm assuming that it is prudent to have three separate teams of developers, with no shared libraries, for a 2 in 3 voting system to eliminate the effect of common-mode design principles, which might fail to fault on errors.
If applying Voting systems to an asynchronous system, such as TCP/IP messaging (where out-of-band packet responses are integral to the design of the system), how do you set time-outs? If they are set too strict then you get the deleterious snowball effect, bringing down the whole system. Too slack and you might just as well use legacy technology.
-
-
Friday 4th June 2021 22:44 GMT They call me Mr Nick
Floating Point Fault
Many years ago while at University I heard of an interesting fault.
A physicist had run the same program a few days apart but had got different results. And had noticed.
Upon investigation it transpired that the floating point unit of the ICL mainframe was giving incorrect answeres in the 4th decimal place.
This was duly fixed. But I wondered at the time how many interesting discoveries in physics were actually undiscovered hardware failures.
-
Saturday 5th June 2021 06:24 GMT Ken Moorhouse
Re: Floating Point Fault. had got different results
That makes it nondeterministic, which is subtley different to giving incorrect (yet consistent) answers in the 4th decimal place.
Maybe the RAM needed to be flushed each time the physicist's program was run. Perhaps the physicist was not explicitly initialising variables before use.
-
-
Sunday 6th June 2021 05:54 GMT Paddy
You get what you paid for.
Doctor, doctor, it hurts if I use this core!
Then use one of your millions of others?
But I need all that I buy!
Then buy safer, ASIL D, automotive spec chips.
But they cost more!
So you're cheap? NEXT!
Chip lifetime bathtub curves are statistical in nature. When you run that many cpu's, their occasional failures might be expected; and failures don't need to be reproducible.
-
-
-
Tuesday 8th June 2021 11:40 GMT yogidude
Re: An oldie but a goodie
The original bug in eniac was also not part of the design, but gave its name to what we now refer to generically as a flaw in the operation of the device/software whatever the cause. That said, since posting the above I realised I omitted a line from the original (early 90s) joke.
I am Pentium of Borg.
Division is futile.
You will be approximated.
-
-
-
Sunday 6th June 2021 20:51 GMT Piro
CPU lockstep processing
Somewhere, there are some old greybeard engineers that developed systems like NonStop that are shaking their heads, finally, we've hit a scenario in which having multiple CPUs running in lockstep would solve the issue!
At least 3, so you can eliminate the "bad" core, generate an alarm, and keep on running.
But no, everyone thought that specialist systems with all kinds of error checking and redundancy were wild overkill, and everything could be achieved by lots of commodity level hardware
-
Sunday 6th June 2021 23:10 GMT Anonymous Coward
I had a 40 year career diagnosing apparently random IT problems that people said "couldn't happen". Electrical noise; physical noise; static; "in spec" voltages that weren't tight enough in all cases; cosmic particles; the sun shining. English Electric had to ban salted peanuts from the vending machines because the circuit board production line workers liked them.
Murphy's Law always applies to any window of opportunity: "If anything can go wrong - it will go wrong". Plus the Sod's Law rider "..at the worst possible time".
Back in the 1960s my boss had a pragmatic approach about apparent mainframe bugs "Once is not a problem - twice and it becomes a problem".
-
Monday 7th June 2021 04:49 GMT daveyeager@gmail.com
Isn’t this just another one of those things IBM knew about decades ago? I’m pretty sure they built in all these redundancies into their mainframes to catch exactly these types of very rare hardware malfunctions. And now the new kids on the block are like “look what we’ve discovered”
Things I want to know: are they becoming more common because more cores means a higher odds one will go haywire? Or is it because the latest manufacturing nodes are less reliable? Are per transistor defect rates actually increasing? How much more prevalent are these errors compared to the past for the same number of cores or servers? Lots of unanswered questions in this article.
-
Monday 7th June 2021 12:16 GMT Anonymous Coward
L3 cache ECC errors
We've had a mystery case of some (brand new) Intel i7 NUCs suffering correctable and non-correctable level-3-cache ECC errors (giving a Machine Check Error) in the past 18 months.
Only happens when they're heavily stressed throwing (video) data around.
Seems to be hardware specific - get a "bad batch" which are problematic, showing up errors every few hours, or few 10's of hours.
Same software runs for 100's or thousands of hours on other hardware specimens with no issue.
Related? I don't know.
-
Tuesday 8th June 2021 00:23 GMT Claptrap314
Re: L3 cache ECC errors
This does sound similar. Seriously, it might be worth a sit-down to talk through, even though it's been 15 years since this was my job. I don't know if you are big enough to merit an account rep with Intel or not, but if you are, be sure to complain--those parts are defective, and need to be replaced by Intel. (And Intel _should_ be pretty interested in getting their hands on some failing examples.)
-
This post has been deleted by its author
-
Tuesday 8th June 2021 15:29 GMT Anonymous Coward
Re: L3 cache ECC errors
We tried talking to Intel but initially they gave us the run-abouts ("we don't support Linux" etc). (We probably buy a few hundred per year - but growing exponentially)
As far as I know we're still effectively "screening" new NUCs by running our code for 48-72 hours, and any that fault in that time are put to one side and not sent to customers. Any that report faults (via our telemetry) in the field will be swapped out at the earliest opportunity.
Apparently 6 months or so after we started seeing the problem we did get some traction from Intel, sent them a NUC to analyse, and a few months later a BIOS update was issued - which fixed the problem on the NUCs we knew to be flaky at the time. Intel's BIOS release note cryptically says "Fixed issues with Machine Check Errors / Reboot while running game engine".
I understand we've seen some occasional MCEs since the new BIOS on some specimens, although they may have different root cause...
-
Tuesday 8th June 2021 16:51 GMT Claptrap314
Re: L3 cache ECC errors
Ouch, ouch, and OUCH!
If you are seeing a steady problem while ordering less than a thousand CPUs, then this is a HUGE deal. A 1/1000 escape should stop the line. Seriously, talk to your management about going to the press with this.
BECAUSE you are way, way too small for Intel to care about. And..Intel's parts are everywhere. This isn't going to just affect gamers & miners.
As to what you can do in the mean time, a couple of things come to mind immediately, sorry if you are already going there.
1) Double errors happen roughly at the square of the rate of single errors. Screen on corrected bits over a certain level, don't wait for the MCE.
2) Check that you are actually running the parts in-spec. I know this sounds insane, but if your workload drives the temperature of the core too high, then you are running it out of spec. Of course, manufacturers do significant work to predict what the highest workload-driven effect can be, but don't trust it. Also, read the specs on the temperature sensors on the part very carefully. Running the core at X temp is not the same has having the senors report X temp.
3) It sounds like it would be worthwhile to spend some time reducing your burn-in run. (I come from the manufacturer's side of things, so the economics are WAY different than I'm used to, but still...)
a) Part of why I wanted you to check the temp spec so carefully is that if you are sure that you are in spec, you might be able to run at a slightly higher ambient temp & still be in spec. Why would you want to do this? Because the fails will happen faster if you do.
b) Try to identify which part(s) of your workload are triggering the fails, and just run that part over & over. I had test code that could trigger the 750 Medal of Honor bug after 8-10 hours. Eventually, I got it to fail in 1 second.
c) Try to see if there is a commonality to the memory locations that fail. As I mentioned with the Nintendo bug for the 750, it might be possible to target just a handful of cache lines to activate the failure.
-
-
-
-
Friday 8th October 2021 17:11 GMT BPontius
Mercurial, so basic predictable human behavior. Imagine the problems that will crop up if/when quantum computers become of such complexity? The computing error within a qubit with multiple values before collapsing the super positioning and the possibility of quantum entanglement between qubits.