> "The other half is a mix of false accusations and limited reproducibility."
Perfect for AI facial recognition workloads in Apple stores then.
Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design …
...for the cloud.
Before anyone posts the obvious rebuttal: note this phrase "two of the world's larger CPU stressors, Google and Facebook".
If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time.
I grew up with the concept of "the clock pulse". If we're pushing synchronous data to the limits (rise-fall time of data wrt clock pulses) then you could arguably get a skew effect. If designers are in denial about that then there are big problems ahead. (Rowhammer is a related problem).
To me this sounds like quantum effects. No manufacturing process produces exact replicas; there is going to be subtle variation between chips. I don't know anything about modern chip design and manufacture so can't speculate what it could be. But electron behaviour is just the law of averages. And so whatever these defects are, it means electrons can periodically jump where they shouldn't. The smaller the currents, the fewer party* electrons are needed for this to become significant.
* The party number is one of the important quantum numbers. It determines how likely an electron is to be an outlier. It's normally represented as a mullet.
But these are process variations that are being missed by manufacturers and where the chip generally functions as required. Just every once in a while it goes haywire. You could call it fate. You could call it luck. You could call it Karma. You could say it's mercurial or capricious. Or you could suspect some process variation allows tunnelling with low probability, or that some other odd transition or excitation is happening.
It's just down to the statistics of very rare events with very large N. If you have a reliable processor with a clock speed of 10^9 hertz that gives you just one error every 10^20 clocks, then you can expect an error every 3000 years or so, say a one in five hundred or a thousand chance of seeing a single error during the 3-6 year life of the system. I can live with that for my laptop.
But if you buy a million of those processors and run them in parallel in data centres then you will see roughly an error every day.
The trouble is that those errors aren't evenly spread. Specific individual cores go bad. The chances are against you having one of those in your laptop or one of your on-premises server, but if you do have one then you may experience a series of mysterious crashes, incorrect calculations and/or data loss, not just one incident.
"If your critical business processes are on-prem, the chances are that you will not be stressing your CPU's to "mercurial" levels. But if your accounts data (for instance) is in the cloud, chances are CPU time in crunching it is being shared with other companies' CPU time."
I don't think it's anything to do with CPU time, but simply the number of CPUs. As the article notes, it's a few problematic cores per several thousand CPUs, ie. it's not random failures due to the large amount of use, it's some specific cores that have a problem. But since the problems are rare, only people operating many thousands of them are likely to actually encounter them. So it's a bit misleading to call them "stressors" of CPUs; it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs.
So it's hard to say if on-prem would be better or not. On the one hand, you're unlikely to have enough CPUs to actually have a problem. But if you get unlucky and you do, the problematic core will be a greater percentage of your computing, and you're unlikely to be able to actually spot it at all. On the other hand, being assigned different CPUs every time you run a task in the cloud makes it almost inevitable that you'll encounter a troublesome core at some point. But it's unlikely to be a persistent problem since you won't have the same core next time, and the companies operating at that scale are able to assign the resources to actually find the problem.
No, because of the way crypto is designed. Any miner who tries to submit a mined block, will have it tested by every other node on the network. If the miner's system glitched, then the block just won't be accepted. And this sounds rare enough that a miner would just shrug and move onto the next block.
"Stream ciphers", one of the common kinds of encryption algorithm, work by taking a key and generating a long string of pseudo-random numbers from that key. That then gets XOR'd into the data.
It's the same algorithm to encrypt and to decrypt. (Like how ROT13 is the same algorithm to encrypt and to decrypt, except a lot more secure).
So it's certainly possible that a core bug results in the specific sequence of instructions in the pseudo-random-number generator part giving the wrong answer. And it's certainly possible that is reproducible, repeating it with the same key gives the same wrong answer each time.
That would lead to the described behaviour - encrypting on the buggy core gives a different encryption from any other core, so only the buggy core can decrypt it.
maybe they need to use an encryption algorithm that isn't susceptible to (virtually) identical math errors during encryption and decryption. Then you could self-check by decrypting the encrypted output and comparing to the original. So long as the errors produce un-decryptable results, you should be fine.
it's not about how much stress any particular CPU encounters, but rather about companies that happen to use a lot of CPUs
Well, it's also about how much of the time a given CPU (or rather each of its cores) is being used, since that's what gives you a result that might be incorrect. If a company "uses" a million cores but a given core is idle 90% of the time, they'll be much less likely to encounter a fault, obviously.
So while "stressing" is probably not really an accurate term – it's not like they're using the CPUs outside their documented envelope (AFAIK) – "using more or less constantly" is a relevant qualification.
The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests.
If you consider chips as not too dissimilar from the networking of smarter humans, emerging anomalies are much easier to understand and be prepared for and accepted as just being an inherent endemic glitch always testing novel processes and processing there is no prior programming for.
And what if they are not simply errors but other possibilities available in other realities/times/spaces/virtually augmented places?
Are we struggling to make machines more like humans when we should be making humans more like machines….. IntelAIgent and CyberIntelAIgent Virtualised Machines?
Prime Digitization offers Realisable Benefits.
What is a computer other than a machine which we try to make think like us and/or for us? And what other model, to mimic/mirror could we possibly use, other than our own brain or something else SMARTR imagined?
And if through Deeper Thought, our Brain makes a Quantum Leap into another Human Understanding such as delivers Enlightened Views, does that mean that we can be and/or are Quantum Computers?
And is that likely to be a Feared and/or AWEsome Alien Territory?
......some of us (dimly) remember the idea of a standard development process:
1. Requirements (how quaint!!!)
3. Unit Test
4. Functional Test
5. Volume Test (also rather quaint!!)
6. User Acceptance Test (you know...against item#1)
.....where #4, #5 and #6 might overlap somewhat in the timeline.
Another old fashioned idea was to have two (or three) separate installations (DEV, USER, PROD).......
......not sure how any of this old fashioned, twentieth century thinking fits in with "agile", "devops", "cloud"....and other "advanced" twenty first century thinking.
......but this article certainly makes this AC quite nostalgic for days past!
In the Elder Days, when things was Less Rushed, sure, you could take your time with a product, and deliver a product that lived up to its promises.
Nowadays in these Younger Days everything is rushed to market (RTM) after a vigorous spit 'n polish and sugarcoating session to hide most of Them Nasteh Buggreh Bugs. And nary a peep of said TNBB's either... hoping said NTBB's won't manifest themselves until closer to the End Lifetime of the Product.
Case in point - MCAS.
ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.
I never saw any problems with an 8080, 8085, 8048, or Z80 that I didn't create myself and fixed as soon as I saw the problem. Processors used to be completely reliable until the marketing and sales department start to want to add "features" which have lead to all of today's issues.
ZX80/81/Speccy users must be chortling with glee as their ancient Z80 CPU's still produces vaild results and nary a corruption.
On the other hand, back when, a friend of mine remarked that the Sinclair Scientific calculator was remarkably egalitarian, because if you didn't like the answer it gave you, you just had to squeeze the sides and it would give you a different one.
AC has described the ideal case.
In practice, there were repeats of item 1 between items 2 and 3, 3 and 4, etc. Table-thumping customer managements and toadying contractor sales people.
(S)He also omits a necessary step between 1 and 2, namely the software design. The requirements stated what was thought to be required - not always a correct piece of analysis. The software design says how you get there in terms of data structures and algorithms. Once software got past transcribing maths into FORTRAN the SD was essential.
For CPUs, replace software with microcode. This was even more problematical than orthodox code.
7. Use in production.
It's only in 7, and even then only at large scale that rare, sporadic failures become recognisable. Even if you were lucky enough to catch one at the previous stages you wouldn't be able to reproduce it reliably enough to understand it.
Is the solution of replacing the CPU with another identical not a good idea, or will the new one start misbehaving in the same way ?
The article states that Google and FaceBook report a few cores in a thousand. That means that most CPUs are functioning just fine, so rip out the mercurial CPUs and replace them. That should give a chance of solving the immediate issue.
Of course, then you take the misbehaving CPU and
give it a good spanking, euh, put it in a test rig to find out just how it fails.
The question is whether this is at the CPU level, the board level, the box level, system level. Tolerances* for all of these things gives rise to unacceptable possibilities - don't forget at the board/box level you've got power supplies and, hopefully UPS's attached to those. How highly do these data centres/centers rate these seemingly mundane sub assemblies, for example? (I'm sure many of us here have had experiences with slightly wayward PSU's).
*The old-fashioned "limits and fits" is to my mind a better illustration of how components work with each other.
Can we have ECC RAM supported by regular chipsets, please. Like we had certainly in the late 90's / early 2000's off the shelf. The sheer quantity of RAM and reduced tolerance to radiation means probability of bitflips are rather greater today than before.
Either AMD or Intel could put support back into consumer chipsets as an easy way to get an edge over competitors.
Regarding, CPU's, there's a reason satellite manufacturers are happy using a 20 year old architecture and manufacturing process at 200nm. Lower vulnerability to radiation-induced errors. (And using SRAM rather than DRAM too for same reason). Performance, cost, "tolerable" error. Rather less practical to roll back consumer performance (unless you fancy getting some genuinely efficient software out in circulation).
I have worked for a few hardware companies over the years and every single one has at some point had issues with random errors causing system crashes at above designed rates - these were all bit-flip errors.
In each case the people who noticed first were our biggest customers. In one of these cases they way they discovered the problem was products from two different companies exhibiting random errors. A quick look at both motherboards showed the same I/O chipset in use. Radioactive contamination in the chip packaging was the root cause.
You can mitigate these by putting multi-layer parity and ECC on every chip, bus and register with end-to-end checksumming. That will turn silent data corruption in to non-silent but it's also really expensive.
But at least let's have ECC as standard!
I forget the exact details - this was over 10 years ago - but IIRC systems that had generated these errors were put in a radiation test chamber and radioactivity measured. Once you have demonstrated there's a problem then it's down to the chipset manufacturer to find the issue. I think it was just low level contamination in the packaging material that occasionally popped out an Alpha particle and could flip a bit.
The remediation is a massive PITA. I think we were dealing with it for about 2 years from initial high failure rates to having all the faulty systems replaced.
Over the years I have spent far more of my career dealing with these issues than I would like. I put in a big shift remediating Seagate MOOSE drives that had silent data corruption as well.
Biting the hand that feeds IT © 1998–2021