no mention of bitcoin mining
Wonder how that would get on.
Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists. The Titan V is the Silicon Valley giant's most powerful GPU board available to date, and is built on Nv's Volta technology. …
reinstate the bite-test
Which, with gold coins, was mostly to test wether they were adulterated with lead. Some unscrupulous moneylenders or mints would use it to make the expensive stuff go further thus creating more profit.
Just as well our financial institutions today are not that unethical eh?
For Mining (not bitcoin as that's ASCI only now) but any Alt coins:
The Titan gets around 77MH on ETH at a cost of £3k and drawing over 230W.
A Vega 56 or 64 get 48MH on ETH at a cost of £500 (but really £600-900 now) but only draw 100W (you do need to do some serious optimisation on both to get these figures)
If its throwing memory errors then it will also cock up the mining. Memory errors result in incorrect shares which are not paid for at all.
Very surprised at how low the HBM memory bandwidth is, 8GB HBM2 on a Vega will overclock to over 600GBs, This Titan has 3x the bus but only 652GBs, which suggests they are not using the vastly superior Samsung chips but are using the Hynix pants.
No mention because no self respecting miner would use a titan to mine any cryptocurrencies. (let's ignore the fact bitcoin hasn't been minable on GPU's for about 5 years)
We don't just need speed we need efficiency. Titans are not efficient in price nor power usage.
I guess we shouldn't be surprised. Certainly, I don't know if this is an architectural issue but speed at all cost is what is being pushed out and it's what sells, although if there was a real cost to maunfactures maybe they'd reel it in a little. If it's good enough for Gaming use but not scientific use it should be labelled as such. It is aimed at industry, so needs to perform better than it does.
from the website:
https://www.nvidia.com/en-us/titan/titan-v/
NVIDIA TITAN V is the most powerful graphics card ever created for the PC, driven by the world’s most advanced architecture—NVIDIA Volta. NVIDIA’s supercomputing GPU architecture is now here for your PC, and fueling breakthroughs in every industry.
I'm not. I recall seeing a seminar or conference presentation shortly after people began touting GPU's for scientific computation, where the presenter pointed out that they were optimized for graphics output at speed ... a domain where getting the a bit or two wrong every now and then would only show up as a probably small and very brief visual glitch.
Still, perhaps in the intervening decade or whatever, this became less likely, and so now the problem is re-emerging?
Indeed, it used to be that GPU were completely unreliable for precise computations. Of course, that has changed in the past decades, when the industry realized that there was money in fast GPUs that did not make mistakes, and advertised them as such.
There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.
nVidia market this one as being for compute, and actively discourage using their other - much cheaper - cards for this kind of thing.
To put it bluntly, this is a massive blunder on the part of nVidia that's going to damage their reputation for a decade or more.
There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.
This. Causal users might not care. But for visual artists, this will act like a bug and bug the hell out of them. It'll be even worst when a day worth of rendering has this one single visual imprecise that forces the artists to constantly redraw their art assets.
People have been trying to shoehorn NVIDIA gaming GPUs into scientific applications ever since the Tesla GPUs were first introduced. NVIDIA's marketing has always been ambiguous regarding the differences between desktop and enterprise kit, and they leave it up to the systems integrators to tell customers whether or not the GPU would be a good fit for their applications. They love to tout the "supercomputing" architecture of the gaming GPUs as a way to move more units, but they're not designed to replace their Tesla line. They're produced from the same binning practices that distinguish desktop CPUs from their enterprise counterparts; a Tesla GPU that doesn't meet the performance standards to be enterprise-worthy gets remade into a desktop GPU with a bunch of features disabled.
The fact is that gaming GPUs can be used in various scientific applications, but they've never promised accurate results, because they don't have double precision. If you don't know why you would need double precision, then you probably shouldn't be in the market for a supercomputer.
when rounding to 1 significant figure.
actually, I think this is exactly what's happening. People here use Nvidia cards to do scientific computing, and recently they have reported that the same calculations bring different results. They attribute this to the massively parallel calculations that spread the sub-calculations on the many cores differently each time, such that intermediate results are calculated along different paths, and since these are floating-point calculations the roundings in these intermediate results differ.
They say that, unlike in a real processor, there is no OS so there is no way to tell the processor what and when and how to do: it does it's magic on his own.
I spent 10 years doing microprocessor validation, from 1996-2006.
1) There an approximately 0% chance that this is due to pushing memory to the edge of the envelope. All doped silicon degrades with use. If they push the envelop, then all of their cards will die sooner rather than later. The closest you get to this is what we call "infant mortality", where a certain percentage of product varies from the baseline to the point that it dies quickly.
2) In order to root cause errors of this sort, it is really, really important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it.
3) Design bug or manufacturing defect, inconsistent results come down to timing issues in every case I saw or heard about. In the worse case, you get some weird data/clock line coupling that causes a bit to arrive late at the latch. Much more often there is some corner case that the design team missed. Again, I would need to know what the nature of the computations involved, and the differences observed, to meaningfully guess at the problem.
"important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it."
The article said that one person tested 4 GPUs and found problems with 2 of them. Given the small sample size, and only being tested on a single problem, I don't think there's really enough information to figure out which it might be.
I'll tell you about the card behaviour...
At the moment mine is mining Ethereum. I need to offset the cost until the newest generation of CUDA gets proper support and drivers get matured.
I used to run the card overclocked with 120% Power and +142 Memory Overclock(best stable overclock at that time). It worked for a bit more than a month with no issues. I talked to other people online who couldn't get theirs to run stable as mine did. I was quite happy that I won the silicon lottery until it stopped working at this settings.
After a month and a bit, the card is stable at 100% power and +130 Memory overclock. If I try to overclock it a bit higher it stops working properly after a few hours. The ethminer displays the calculations but it doesn't go through with sending the results back to the pool.
It seems to me that it has something to do with the memory as well.
The card has degraded in performance with time and as far as ethereum mining goes it is related to the memory. Could you please explain how is this possible?
This is possible because semiconductor structures on silicon can wear out due to metal migration based on current densities and temperature. This causes their switching characteristics to change or degrade over time. Device Mean-Time-To-Failure (MTTF) is typically characterized using the arrhenius equation where higher device temperature results in shorter life.
In a CMOS transistor structure, most power is dissipated when switching logic states. Power translates into heat. As operating frequency is increased, the transistors switch more often in less time causing more heat, which will accelerate wearout.
It used to be the characteristic life of a device could be a 100 years or more, but with operating frequencies now in the GHz levels and device feature sizes shrunk to pack more transistors into less real estate, the design margins have shrunk to the point where characteristic lifetimes are reduced to a decade or two and greatly shortened by overclocking.
Yes, I am a reliability engineer.
In the real world there aren't exact answers and very questions involving integers
If the errors are small and/or rare enough - that's fine.
Out experiments give bad results all the time - we understand that, it's the whole point of experimental physics.
It completely depends on the use case.
Galaxy dynamics and Knights of the Old Republic - fine.
Rocket dynamics - not fine.
From Floating-Point Arithmetic Besieged by “Business Decisions” - A Keynote Address, prepared for the IEEE-Sponsored ARITH 17 Symposium on Computer Arithmetic, delivered on Mon. 27 June 2005 in Hyannis, Massachussets
You have succeeded too well in building Binary Floating-Point Hardware.
Floating-point computation is now so ubiquitous, so fast, and so cheap that almost none of it is worth debugging if it is wrong, if anybody notices.
By far the overwhelming majority of floating-point computations occur in entertainment and games.
IBM’s Cell Architecture: “The floating-point operation is presently geared for throughput of media and 3D objects. That means ... that IEEE correctness is sacrificed for speed and simplicity. ... A small display glitch in one display frame is tolerable; ...” Kevin Krewell’s Microprocessor Report for Feb. 14 2005.
A larger glitch might turn into a feature propagated through a Blog thus: “There is no need to find and sacrifice a virgin to the Gorgon who guards the gate to level 17. She will go catatonic if offered exactly $13.785.”
How often does a harmful loss of accuracy to roundoff go undiagnosed?
Nobody knows. Nobody keeps score.
And when numerical anomalies are noticed they are routinely misdiagnosed.
Re EXCEL, see DavidEinstein’s column on p. E2 of the San Francisco Chronicle for 16 and 30 May 2005.
Consider MATLAB, used daily by hundreds of thousands. How often do any of them notice roundoff-induced anomalies? Not often. Bugs can persist for Decades.
e.g., log2(...) has lost as many as 48 of its 53 sig. bits at some arguments since 1994 .
PC MATLAB’s acos(...) and acosh(...) lost about half their 53 sig. bits at some arguments for several years.
MATLAB’s subspace(X, Y) still loses half its sig bits at some arguments, as it has been doing since 1988.
Nevertheless, as such systems go, MATLAB is among the best.
Rocket dynamics - not fine.
Depends on the trade off.
Would like you to use a mainframe with duplicate processors checking each result - but only enough power to model your combustion chamber with 1m^3 cells
Or a bunch of cheap GPUs where you can do 1mm^3 but some small percentage of those cells have incorrect values?
Classic report I once got back from our supercomputer center. "The cray detected an uncorrected memory fault during your run. Please check your results" . If I could check the results manually why would I need the fsckign Cray?
Among other product, I worked on the STI Cell microprocessor. It was my job to compare the documents to the actual product. The documents plainly the accuracy of the division-class instructions. If MATLAB or whoever failed to write code that incorporated the information in the documentation, whose fault is that? That MATLAB has been so egregiously wrong for a decade before the STI Cell microprocessor came out should help if anyone is confused.
IEEE-754 is fine for specifying data representation and last-bit correctness. But the nasty corners are an excellent example of just how bad committee work can be. And the folks writing floating point libraries regularly produce code that is mediocre at best. Don't blame hardware for your software bugs.
Or even the similar floating-point problem with the DEC VAX (8600/Venus family I think) circa 1988? Recollections of my then employer's chemical engineering department having to re-run significant amounts of safety-critical design related computations over a period of weeks and months on alternative (slower) VAX models. In the support teams we ended up scheduling low-priority batch jobs running tasks with known results set up to flag re-occurances of the problem as it wasn't failing consistently.
Some versions of PowerPC’s AltiVec SIMD unit are optimised to complete instructions in a single clock cycle at the expense of perfect numerical accuracy. Well documented, understood and repeatable, this was fine for games, signal processing, image processing.
This problem with NVidia’s latest sounds different. Sounds like a big mistake in the silicon process, or too optimistic on the clock speed.
"Glitches on Titan V
Wish Asimov (or Heinlein, or Pournelle) were here to write it."
How close is "Missed cache on Ganymede"?
https://en.wikipedia.org/wiki/Christmas_on_Ganymede : aboriginal workers on a Jovian satellite named after a pretty boy employed as Zeus's "cup-bearer" (!) demand a visit from Santa Claus on a flying sled bringing presents. Corporate colonial Earthmen manage, with great effort, to provide this performance and avoid or conclude strike action. Earthmen then realise that an annual visit from Santa Claus means once every orbit of Ganymede around Jupiter, or once a week.
Quite a miscalculation, there.
I would say that in reality the "Ossie" strikers (look like ostriches) would (not should, but would) be taken up on the sled and dropped from considerable height, but I'm not sure that that would matter. They might even fly down.
Wikipedia reports another miscalculation of sorts: Asimov, aged about 20.9, wrote and offered the short story in December 1940, then became aware that a Christmas story needed to be sold by July to appear by Christmas. In fact he sold it the following June. Then of course in the 1990s Robert Silverberg expanded it to novel length. :-) Not true, but probably he still has time this year, it is only March......
I'm thinking this "bug" is intentional. Since it only applies to scientific calculations, and exists on the "gaming GPU", I'm thinking this is Nvidia's way of getting people to stop using the cheaper GPUs to build HPC's.
Didn't they ask nicely for folks to stop using the gaming GPUs for HPC, then change their T&Cs to prevent such action? I'm almost 100% positive that they did, which makes this rather suspect in my eyes.
"Didn't they ask nicely for folks to stop using the gaming GPUs for HPC, then change their T&Cs to prevent such action"
I'm not sure nicely went anywhere near it, it's a straight up money grab. It would be like if you could buy a PC for gaming for current prices, but if it's used in any way to make money, then it costs 15 times as much, plus an annual licence fee equal to 5 times the cost.
The 16-bit cards are licenced annually IIRC, as well as costing about as much as a small car. If 8-bit accuracy will do you, then you can get 3-4 cards and the computer to go with them for about the same as the dev licence fee to go with that 16-bit card.
The software licencing change was to say you can't use CUDA libraries in a "datacentre" without paying their annual dev licencing fee.
At no point have they said, or should be able to (barring some serious revision of end user rights), you can't use a particular GPU for a particular purpose. You just can't use their optimized libraries.
I'm not sure it's *quite* worked as intended, as in academic and research circles it just seems to have validated the use open source argument. There had been some debate on whether you use OpenGL or CUDA, with resulting in no ATI cards being bought. But now you either find another 20-30 grand in the budget, or you use OpenGL.
Using float istead of decimal as a datatype for things such as financial caculations in C type languages is a common mistake that results in these sort of errors (floating point maths is not 100% decimally accurate). If I had a pound for every time I'd seen it occur in a business system I'd be a good £7.01 better off.
Not necessarily. Floating point arithmetic is not commutative - (a+b)+c =/= a+(b+c) [https://en.wikipedia.org/wiki/Associative_property]. If you are running on a parallel system you can't be certain which order operations will be done in - it will depend on how the system has scheduled your calculation out to the processing units. I've had this happen when running statistical models using OpenMP on a single CPU, on a GPU it wouldn't surprise me at all.
Agree, just
s/commutative/associative/
floating point is commutative.
Anyway I assume the scientific algorithm is designed to be stable accounting for parallel execution. (so associative errors should not accumulate this way).
=> somerthing more than just an algorithm to blame
Aside from Horridbloke's point about repeatably, probably not. There are many use cases for floating point datatypes as long as the accuracy level is understood and operated within. For example a certain floating point type may be accurate to 15 decimal places and therefore calculations should be accurate to the same. Although the reality is that if you want accuracy to 15 decimal places then you need to work with accuracy at least an order or two beyond this. The same is true for financial calculations: if you want accuracy to two decimal places then either store and process everything with accuracy to three or four decimal places or perform repeatable rounding in sub-totals and uses these values for grand totals, not separate calculations.
"At the quantum level, there is uncertainty as to position, or even the outcome. It seems these cards are modelling that behavior."
Point is they're not supposed to be doing that. Even if the algorithms generate deterministic chaotic outputs that wouldn't explain 2 cards being OK
Seriously, this is not how it's done. You are actually trying to calculate the Schrodinger wave function, not the measurements values. In fact you are calculating probabilities, not actual values. So the only source of uncertainty in the calculations should be the approximation in the calculation methods (for example calculating partial sums where the real values is an infinite sum).
"Engineers speaking to The Register on condition of anonymity to avoid repercussions from Nvidia said the best solution to these problems is to avoid using Titan V altogether until a software patch has been released to address the mathematical oddities." Wow. This was the most interesting part of the article. What sort of "repercussions" do the engineers expect Nvidia to throw at them? What has Nvidia done in the past to engineers who reveal these types of flaws?
GPU's good enough for gaming but not so great at bitcoin and other non gaming applications will bring the price on high end gaming graphics cards back down to something more reasonable.
The bit miners can create a niche market for dedicated "BmPU" (Bitmining Processor Units) that I'm sure someone will seek to corner.
Serious Bitcoin miners, and miners of many other crypto currencies, have switched to specialised hardware generically known as ASIC - silicon which can only do one thing (eg run the Bitcoin algorithm). The development cost of such dedicated silicon is high, so only bigger miners use it. Having only a few large players in the game runs contrary to the point of crypto currencies i.e the work on the ledger should be distributed amongst millions of users so that no one entity can influence over 50% of the work. Apparently having having fewer big players over many small players leads to fluctuations, too.
Ethereum - perhaps the largest CC after Bitcoin - uses an algorithm that is deliberately suited to GPUs over current ASIC designs because is requires a lot of memory. In addition, Ethereum keeps threatening to switch from Proof of Work to Proof of Stake in an effort to dissuade anyone making the investment in an Ethereum ASIC.
Ironically, it was the idea that millions of people already possessed powerful CPUs and GPUs for non-mining purposes - thus combined they had more power than any Bad Actor might hope to acquire - that was central to Bitcoin.
...All in all, it is bad news for boffins as reproducibility is essential to scientific research....
Er. NO. Grants are essential to scientific research. To get grants you need to have published papers. To get papers published you need to have spectacular findings, supported by amazing data, And if your raw data is not amazing enough - just change it.
I recommend this site: https://retractionwatch.com/
What you are referring to is grant farming. Not to be confused with real scientific research (easy to confuse the two as the white coats mix together like zebras on the Serengeti) . Real scientific research (tm) is low key and frequently rubs the consensus up the wrong way and has a tough time in peer review whilst grant farming only ever works within the consensus paradigm and shoots through pal review with spelling mistakes intact.
An error like the one described could wreak havoc with exact calculations, but I wonder if it would make much difference in the machine learning models for which these cards are so widely used. In the early days of neural networks, "graceful degradation" was said to show their similarity to our brains, where the loss of a neuron or two has a negligible effect. A systematic error might throw off calculations of connection weights, but random errors might well have a comparably minor effect.
Actually I expected something like this.
The problem with effects like Rowhammer is that the actual physics comes back to bite you.
Fix one problem and it just causes another one, such as in the rarely documented cases of ASLR breaking down with hash collisions where the data being processed matches up by chance with the ASLR algorithm used.
I found that DDR4 and 5 does still get problems related to heat, can cause all sorts of instability and it is well known that for high reliability applications you want to run the chips at far less than maximum clock rate to give those trillions of capacitors a fair chance for refresh to work properly.
Over, and over again: floating point does not exist in the part of the real world that computers actual interact with. I've built an (Intel) IEEE-754 emulator. If you don't know why I needed to specify Intel, then you don't really understand the standard. I've also done proofs of the accuracy of floating point computations.
If you are doing currency computations, you are working exclusively with integers. But not integers of the currency base, integers of the smallest denomination. (For the US dollar, this is a mill-- 0.1 cents)
If you are taking a reading from an instrument, you are getting a quantized result. Those translate to integers.
The problem is that we like to write contracts that require us to divide. See the "average daily balance" for credit cards, or the financial compacts that lead up to the Euro. But even then, when the computation is done, the results are integers.
Floating point is an abstraction with some really nasty leaks. If you are seriously concerned about last-bit accuracy in your computations, you are going to have to jump through some crazy hoops.
The fact that Excel, MATLAB and other PHB-level applications don't worry about it means that you better not either if you use these products.