back to article 2 + 2 = 4, er, 4.1, no, 4.3... Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations

Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists. The Titan V is the Silicon Valley giant's most powerful GPU board available to date, and is built on Nv's Volta technology. …

  1. Anonymous Coward
    Anonymous Coward

    no mention of bitcoin mining

    Wonder how that would get on.

    1. 9Rune5
      Coat

      Re: no mention of bitcoin mining

      My guess is that we will have to reinstate the bite-test and carefully inspect each and every bitcoin we handle.

      Mine is the one with dentures in the pocket.

      1. CrazyOldCatMan Silver badge

        Re: no mention of bitcoin mining

        reinstate the bite-test

        Which, with gold coins, was mostly to test wether they were adulterated with lead. Some unscrupulous moneylenders or mints would use it to make the expensive stuff go further thus creating more profit.

        Just as well our financial institutions today are not that unethical eh?

        1. Frenchie Lad

          Re: no mention of bitcoin mining

          Ask the Germans why its impossible for them to get their own gold back from the USofA. It was stored there during the Cold War but its return is impossible owing to "security concerns". The Yanks are adamant that they haven't lent/sold it to anyone else.

    2. PNGuinn
      Trollface

      Re: no mention of bitcoin mining

      Fine if you spend 'em feeding your gambling habit?

      1. BebopWeBop
        Happy

        Re: no mention of bitcoin mining

        I take it that the winnings are not paid in Bitcoin?

    3. MonkeyCee

      Re: no mention of bitcoin mining

      Not masses of data on it in the wild, but about 50% increase over a 1080ti would be my guess.

      Maybe 3-4 bucks a day income, 2-3 in profit.

      Not worth it for mining,

    4. HamsterNet

      Re: no mention of bitcoin mining

      For Mining (not bitcoin as that's ASCI only now) but any Alt coins:

      The Titan gets around 77MH on ETH at a cost of £3k and drawing over 230W.

      A Vega 56 or 64 get 48MH on ETH at a cost of £500 (but really £600-900 now) but only draw 100W (you do need to do some serious optimisation on both to get these figures)

      If its throwing memory errors then it will also cock up the mining. Memory errors result in incorrect shares which are not paid for at all.

      Very surprised at how low the HBM memory bandwidth is, 8GB HBM2 on a Vega will overclock to over 600GBs, This Titan has 3x the bus but only 652GBs, which suggests they are not using the vastly superior Samsung chips but are using the Hynix pants.

    5. Anonymous Coward
      Anonymous Coward

      Re: no mention of bitcoin mining

      No mention because no self respecting miner would use a titan to mine any cryptocurrencies. (let's ignore the fact bitcoin hasn't been minable on GPU's for about 5 years)

      We don't just need speed we need efficiency. Titans are not efficient in price nor power usage.

  2. Anonymous Coward
    Anonymous Coward

    I guess we shouldn't be surprised. Certainly, I don't know if this is an architectural issue but speed at all cost is what is being pushed out and it's what sells, although if there was a real cost to maunfactures maybe they'd reel it in a little. If it's good enough for Gaming use but not scientific use it should be labelled as such. It is aimed at industry, so needs to perform better than it does.

    from the website:

    https://www.nvidia.com/en-us/titan/titan-v/

    NVIDIA TITAN V is the most powerful graphics card ever created for the PC, driven by the world’s most advanced architecture—NVIDIA Volta. NVIDIA’s supercomputing GPU architecture is now here for your PC, and fueling breakthroughs in every industry.

    1. Anonymous Coward
      Anonymous Coward

      Re: I guess we shouldn't be surprised

      I'm not. I recall seeing a seminar or conference presentation shortly after people began touting GPU's for scientific computation, where the presenter pointed out that they were optimized for graphics output at speed ... a domain where getting the a bit or two wrong every now and then would only show up as a probably small and very brief visual glitch.

      Still, perhaps in the intervening decade or whatever, this became less likely, and so now the problem is re-emerging?

      1. ratfox

        Re: I guess we shouldn't be surprised

        Indeed, it used to be that GPU were completely unreliable for precise computations. Of course, that has changed in the past decades, when the industry realized that there was money in fast GPUs that did not make mistakes, and advertised them as such.

        There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.

        1. Richard 12 Silver badge

          This one is *supposed* to be used for compute

          nVidia market this one as being for compute, and actively discourage using their other - much cheaper - cards for this kind of thing.

          To put it bluntly, this is a massive blunder on the part of nVidia that's going to damage their reputation for a decade or more.

        2. Anonymous Coward
          Anonymous Coward

          Re: I guess we shouldn't be surprised

          There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.

          This. Causal users might not care. But for visual artists, this will act like a bug and bug the hell out of them. It'll be even worst when a day worth of rendering has this one single visual imprecise that forces the artists to constantly redraw their art assets.

          1. anothercynic Silver badge

            Re: I guess we shouldn't be surprised

            It's not just artists. It's scientists who rely on computations to be exact. Not 4.1. Not 4.3. 4.0. Nothing else.

            1. Paul Shirley

              Re: I guess we shouldn't be surprised

              Using a computer and finite resolution math scientists expect results to be repeatable with provable error bounds, NOT exact. Nvidia returning random tainted values breaks both expectations and gets results wrong in all other senses!

    2. Anonymous Coward
      Anonymous Coward

      This isn't news.

      People have been trying to shoehorn NVIDIA gaming GPUs into scientific applications ever since the Tesla GPUs were first introduced. NVIDIA's marketing has always been ambiguous regarding the differences between desktop and enterprise kit, and they leave it up to the systems integrators to tell customers whether or not the GPU would be a good fit for their applications. They love to tout the "supercomputing" architecture of the gaming GPUs as a way to move more units, but they're not designed to replace their Tesla line. They're produced from the same binning practices that distinguish desktop CPUs from their enterprise counterparts; a Tesla GPU that doesn't meet the performance standards to be enterprise-worthy gets remade into a desktop GPU with a bunch of features disabled.

      The fact is that gaming GPUs can be used in various scientific applications, but they've never promised accurate results, because they don't have double precision. If you don't know why you would need double precision, then you probably shouldn't be in the market for a supercomputer.

  3. John H Woods

    3 <= 2 + 2 <= 5

    <Pedant>

    2+2 can, of course, be anywhere in the range 3..5 when rounding to 1 significant figure.

    </Pedant>

    1. Chris Miller

      Re: 3 <= 2 + 2 <= 5

      2 + 2 = 5

      (for sufficiently large values of '2')

      1. Anonymous Coward
        Anonymous Coward

        Re: 3 <= 2 + 2 <= 5

        2 + 2 = 22

        Not sure where you two went to skool.

        1. John G Imrie
          Happy

          Re: 3 <= 2 + 2 <= 5

          2+2 = 10

          for sufficiently small values of base.

    2. d3rrial

      Re: 3 <= 2 + 2 <= 5

      3 <= 2 + 2 <= 5 = 1

      easy.

      (((3 <= 2) + 2) <= 5)

      3 <= 2 = 0

      0 + 2 = 2

      2 <= 5 = 1

      1. Anonymous Coward
        Anonymous Coward

        Re: 3 <= 2 + 2 <= 5

        You are all clearly smarter than me with my calculation however did you consider trumpets?

        1. d3rrial

          Re: 3 <= 2 + 2 <= 5

          No, trumpets are -p=<

          1. TRT

            Re: 3 <= 2 + 2 <= 5

            2 <= many <= many + 1

    3. Anonymous Coward
      Anonymous Coward

      Re: 3 <= 2 + 2 <= 5

      Ah, now I see why non of the climate change models actually fit reality.

    4. Zolko Silver badge

      Re: 3 <= 2 + 2 <= 5

      when rounding to 1 significant figure.

      actually, I think this is exactly what's happening. People here use Nvidia cards to do scientific computing, and recently they have reported that the same calculations bring different results. They attribute this to the massively parallel calculations that spread the sub-calculations on the many cores differently each time, such that intermediate results are calculated along different paths, and since these are floating-point calculations the roundings in these intermediate results differ.

      They say that, unlike in a real processor, there is no OS so there is no way to tell the processor what and when and how to do: it does it's magic on his own.

  4. Anonymous Coward
    Joke

    Looks like...

    There's some quantum-computing going on in those cards !

  5. IgorS

    Anyone knows if this affects the server-class V100s, too?

  6. Claptrap314 Silver badge

    Redlining memory? Buhahahaha! Not a chance.

    I spent 10 years doing microprocessor validation, from 1996-2006.

    1) There an approximately 0% chance that this is due to pushing memory to the edge of the envelope. All doped silicon degrades with use. If they push the envelop, then all of their cards will die sooner rather than later. The closest you get to this is what we call "infant mortality", where a certain percentage of product varies from the baseline to the point that it dies quickly.

    2) In order to root cause errors of this sort, it is really, really important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it.

    3) Design bug or manufacturing defect, inconsistent results come down to timing issues in every case I saw or heard about. In the worse case, you get some weird data/clock line coupling that causes a bit to arrive late at the latch. Much more often there is some corner case that the design team missed. Again, I would need to know what the nature of the computations involved, and the differences observed, to meaningfully guess at the problem.

    1. 404
      Boffin

      Re: Redlining memory? Buhahahaha! Not a chance.

      What you term 'infant mortality', we out in the field call 'The Shitty One(s)' - Take 100 identical PC's, 93 of them run per spec, 5-6 run fast as fuck, and the final 1-2 total pieces of shite.

      Good times ;)

      1. Flakk
        Pint

        we out in the field call...

        Best laugh I've had all day. For you. Thanks for the field work you do.

      2. Vinyl-Junkie
        Thumb Up

        Re: we out in the field

        Indeed; build identical 100 PCs from the same image, 99 will work fine and the last one will be a never-ending source of problems.

    2. Cuddles

      Re: Redlining memory? Buhahahaha! Not a chance.

      "important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it."

      The article said that one person tested 4 GPUs and found problems with 2 of them. Given the small sample size, and only being tested on a single problem, I don't think there's really enough information to figure out which it might be.

    3. Anonymous Coward
      Anonymous Coward

      Re: Redlining memory? Buhahahaha! Not a chance.

      I'll tell you about the card behaviour...

      At the moment mine is mining Ethereum. I need to offset the cost until the newest generation of CUDA gets proper support and drivers get matured.

      I used to run the card overclocked with 120% Power and +142 Memory Overclock(best stable overclock at that time). It worked for a bit more than a month with no issues. I talked to other people online who couldn't get theirs to run stable as mine did. I was quite happy that I won the silicon lottery until it stopped working at this settings.

      After a month and a bit, the card is stable at 100% power and +130 Memory overclock. If I try to overclock it a bit higher it stops working properly after a few hours. The ethminer displays the calculations but it doesn't go through with sending the results back to the pool.

      It seems to me that it has something to do with the memory as well.

      The card has degraded in performance with time and as far as ethereum mining goes it is related to the memory. Could you please explain how is this possible?

      1. Patched Out

        Re: Redlining memory? Buhahahaha! Not a chance.

        This is possible because semiconductor structures on silicon can wear out due to metal migration based on current densities and temperature. This causes their switching characteristics to change or degrade over time. Device Mean-Time-To-Failure (MTTF) is typically characterized using the arrhenius equation where higher device temperature results in shorter life.

        In a CMOS transistor structure, most power is dissipated when switching logic states. Power translates into heat. As operating frequency is increased, the transistors switch more often in less time causing more heat, which will accelerate wearout.

        It used to be the characteristic life of a device could be a 100 years or more, but with operating frequencies now in the GHz levels and device feature sizes shrunk to pack more transistors into less real estate, the design margins have shrunk to the point where characteristic lifetimes are reduced to a decade or two and greatly shortened by overclocking.

        Yes, I am a reliability engineer.

        1. Paul Shirley

          Re: Redlining memory? Buhahahaha! Not a chance.

          Or the cooling has degraded over a couple of months and the overclock headroom shrank with it.

      2. Anonymous Coward
        Anonymous Coward

        Re: Redlining memory? Buhahahaha! Not a chance.

        Note also that as well as the other answer about hardware degredation, the Ethereum DAG size has grown tremendously, and people are seeing lower hashing rates than a few months ago.

      3. Tom 64

        Re: Redlining memory? Buhahahaha! Not a chance.

        > "Could you please explain how is this possible?"

        Nvidia are known for cutting corners to maximise profits.

        Google for Charlie Demerjian vs Nvidia

  7. Yet Another Anonymous coward Silver badge

    Probably fine for scientific computing

    In the real world there aren't exact answers and very questions involving integers

    If the errors are small and/or rare enough - that's fine.

    Out experiments give bad results all the time - we understand that, it's the whole point of experimental physics.

    1. Anonymous Coward
      Anonymous Coward

      Re: Probably fine for scientific computing

      It completely depends on the use case.

      Galaxy dynamics and Knights of the Old Republic - fine.

      Rocket dynamics - not fine.

      From Floating-Point Arithmetic Besieged by “Business Decisions” - A Keynote Address, prepared for the IEEE-Sponsored ARITH 17 Symposium on Computer Arithmetic, delivered on Mon. 27 June 2005 in Hyannis, Massachussets

      You have succeeded too well in building Binary Floating-Point Hardware.

      Floating-point computation is now so ubiquitous, so fast, and so cheap that almost none of it is worth debugging if it is wrong, if anybody notices.

      By far the overwhelming majority of floating-point computations occur in entertainment and games.

      IBM’s Cell Architecture: “The floating-point operation is presently geared for throughput of media and 3D objects. That means ... that IEEE correctness is sacrificed for speed and simplicity. ... A small display glitch in one display frame is tolerable; ...” Kevin Krewell’s Microprocessor Report for Feb. 14 2005.

      A larger glitch might turn into a feature propagated through a Blog thus: “There is no need to find and sacrifice a virgin to the Gorgon who guards the gate to level 17. She will go catatonic if offered exactly $13.785.”

      How often does a harmful loss of accuracy to roundoff go undiagnosed?

      Nobody knows. Nobody keeps score.

      And when numerical anomalies are noticed they are routinely misdiagnosed.

      Re EXCEL, see DavidEinstein’s column on p. E2 of the San Francisco Chronicle for 16 and 30 May 2005.

      Consider MATLAB, used daily by hundreds of thousands. How often do any of them notice roundoff-induced anomalies? Not often. Bugs can persist for Decades.

      e.g., log2(...) has lost as many as 48 of its 53 sig. bits at some arguments since 1994 .

      PC MATLAB’s acos(...) and acosh(...) lost about half their 53 sig. bits at some arguments for several years.

      MATLAB’s subspace(X, Y) still loses half its sig bits at some arguments, as it has been doing since 1988.

      Nevertheless, as such systems go, MATLAB is among the best.

      1. Yet Another Anonymous coward Silver badge

        Re: Probably fine for scientific computing

        Rocket dynamics - not fine.

        Depends on the trade off.

        Would like you to use a mainframe with duplicate processors checking each result - but only enough power to model your combustion chamber with 1m^3 cells

        Or a bunch of cheap GPUs where you can do 1mm^3 but some small percentage of those cells have incorrect values?

        Classic report I once got back from our supercomputer center. "The cray detected an uncorrected memory fault during your run. Please check your results" . If I could check the results manually why would I need the fsckign Cray?

      2. Claptrap314 Silver badge

        Re: Probably fine for scientific computing

        Among other product, I worked on the STI Cell microprocessor. It was my job to compare the documents to the actual product. The documents plainly the accuracy of the division-class instructions. If MATLAB or whoever failed to write code that incorporated the information in the documentation, whose fault is that? That MATLAB has been so egregiously wrong for a decade before the STI Cell microprocessor came out should help if anyone is confused.

        IEEE-754 is fine for specifying data representation and last-bit correctness. But the nasty corners are an excellent example of just how bad committee work can be. And the folks writing floating point libraries regularly produce code that is mediocre at best. Don't blame hardware for your software bugs.

  8. Stuart Dole

    Shades of the Pentium floating point bug?

    Not the first time this sort of thing has cropped up! Old-timers will remember the famous “Pentium floating point bug”.

    1. Tony Gathercole ...
      FAIL

      Re: Shades of the Pentium floating point bug?

      Or even the similar floating-point problem with the DEC VAX (8600/Venus family I think) circa 1988? Recollections of my then employer's chemical engineering department having to re-run significant amounts of safety-critical design related computations over a period of weeks and months on alternative (slower) VAX models. In the support teams we ended up scheduling low-priority batch jobs running tasks with known results set up to flag re-occurances of the problem as it wasn't failing consistently.

    2. Neil Barnes Silver badge

      Re: Shades of the Pentium floating point bug?

      Yabbut - at least the Pentium got the same wrong answer every time.

    3. bazza Silver badge

      Re: Shades of the Pentium floating point bug?

      Some versions of PowerPC’s AltiVec SIMD unit are optimised to complete instructions in a single clock cycle at the expense of perfect numerical accuracy. Well documented, understood and repeatable, this was fine for games, signal processing, image processing.

      This problem with NVidia’s latest sounds different. Sounds like a big mistake in the silicon process, or too optimistic on the clock speed.

    4. Pascal

      Re: Shades of the Pentium floating point bug?

      I am Pentium of Borg. Division is futile. You will be approximated.

      1. TRT

        Re: Shades of the Pentium floating point bug?

        The ultimate answer? To life, the universe and everything? OK, the answer... the answer is...

        41.999999999999999

        I said you weren't going to like it.

        1. DropBear
          Trollface

          Re: Shades of the Pentium floating point bug?

          Well that depends on whether you meant literally 41.999999999999999, or 41.(9), considering the latter is mathematically exactly equal to 42 (yes, really)...

          1. Anonymous Coward
            Anonymous Coward

            Re: Shades of the Pentium floating point bug?

            It is equivalent - subtle difference

        2. John H Woods

          Re: Shades of the Pentium floating point bug?

          "the answer is... 41.999999999999999" - TRT

          Exactly? Or 41 point nine recurring? The latter is exactly equal to 42

          1. TRT

            Re: Shades of the Pentium floating point bug?

            As a double precision float, obviously.

            1. Wilseus
              Trollface

              Re: Shades of the Pentium floating point bug?

              Anyone remember the old joke?

              Q. What do you call a series of Pentium FDIV instructions?

              A. Successive approximations.

      2. Updraft102

        Re: Shades of the Pentium floating point bug?

        Meh, you beat me... but the Borg never said "I."

        1. Vinyl-Junkie
          Alien

          Re: but the Borg never said "I."

          "I am Locutus of Borg"

          I rest my case....

    5. Adam 1

      Re: Shades of the Pentium floating point bug?

      I think this is all part of the Intel cross licensing arrangements.

    6. Updraft102

      Re: Shades of the Pentium floating point bug?

      We are Pentium of Borg.

      Division is futile. You will be approximated.

    7. sitta_europea Silver badge

      Re: Shades of the Pentium floating point bug?

      Yeah, and I never did get repeatable answers out of mprime on my AMD Opterons.

  9. Steve Aubrey
    Unhappy

    Space opera

    Glitches on Titan V

    Wish Asimov (or Heinlein, or Pournelle) were here to write it.

    1. leenex

      Re: Space opera

      Kurt Vonnegut, actually.

    2. Robert Carnegie Silver badge

      Re: Space opera

      "Glitches on Titan V

      Wish Asimov (or Heinlein, or Pournelle) were here to write it."

      How close is "Missed cache on Ganymede"?

      https://en.wikipedia.org/wiki/Christmas_on_Ganymede : aboriginal workers on a Jovian satellite named after a pretty boy employed as Zeus's "cup-bearer" (!) demand a visit from Santa Claus on a flying sled bringing presents. Corporate colonial Earthmen manage, with great effort, to provide this performance and avoid or conclude strike action. Earthmen then realise that an annual visit from Santa Claus means once every orbit of Ganymede around Jupiter, or once a week.

      Quite a miscalculation, there.

      I would say that in reality the "Ossie" strikers (look like ostriches) would (not should, but would) be taken up on the sled and dropped from considerable height, but I'm not sure that that would matter. They might even fly down.

      Wikipedia reports another miscalculation of sorts: Asimov, aged about 20.9, wrote and offered the short story in December 1940, then became aware that a Christmas story needed to be sold by July to appear by Christmas. In fact he sold it the following June. Then of course in the 1990s Robert Silverberg expanded it to novel length. :-) Not true, but probably he still has time this year, it is only March......

  10. AZump

    Maybe the reason for the error is a bit more sinister.

    I'm thinking this "bug" is intentional. Since it only applies to scientific calculations, and exists on the "gaming GPU", I'm thinking this is Nvidia's way of getting people to stop using the cheaper GPUs to build HPC's.

    Didn't they ask nicely for folks to stop using the gaming GPUs for HPC, then change their T&Cs to prevent such action? I'm almost 100% positive that they did, which makes this rather suspect in my eyes.

    1. MonkeyCee

      Re: Maybe the reason for the error is a bit more sinister.

      "Didn't they ask nicely for folks to stop using the gaming GPUs for HPC, then change their T&Cs to prevent such action"

      I'm not sure nicely went anywhere near it, it's a straight up money grab. It would be like if you could buy a PC for gaming for current prices, but if it's used in any way to make money, then it costs 15 times as much, plus an annual licence fee equal to 5 times the cost.

      The 16-bit cards are licenced annually IIRC, as well as costing about as much as a small car. If 8-bit accuracy will do you, then you can get 3-4 cards and the computer to go with them for about the same as the dev licence fee to go with that 16-bit card.

      The software licencing change was to say you can't use CUDA libraries in a "datacentre" without paying their annual dev licencing fee.

      At no point have they said, or should be able to (barring some serious revision of end user rights), you can't use a particular GPU for a particular purpose. You just can't use their optimized libraries.

      I'm not sure it's *quite* worked as intended, as in academic and research circles it just seems to have validated the use open source argument. There had been some debate on whether you use OpenGL or CUDA, with resulting in no ATI cards being bought. But now you either find another 20-30 grand in the budget, or you use OpenGL.

      1. Anonymous Coward
        Anonymous Coward

        Re: Maybe the reason for the error is a bit more sinister.

        And lower down the scale, a fair few developers of productivity applications started supporting OpenCL when the 'Trash Can' Mac Pro was released with AMD cards.

    2. ibmalone

      Re: Maybe the reason for the error is a bit more sinister.

      I'm actually a bit confused looking at the description, they talk about supporting deep learning (the Tensor cores), but pitch it as a graphics card, where their stance is these are not for compute use.

    3. leenex

      Re: Maybe the reason for the error is a bit more sinister.

      Bug fix in 2018:

      Gaming card with bug UltimgateGamingExperience € 2000.-

      Gaming card without bug TooGoodForGaming® € 4000.-

      Our new card contains the TooGoodForGaming® technology

      Everybody wants it.

  11. SVV

    Are they using the wrong datatype?

    Using float istead of decimal as a datatype for things such as financial caculations in C type languages is a common mistake that results in these sort of errors (floating point maths is not 100% decimally accurate). If I had a pound for every time I'd seen it occur in a business system I'd be a good £7.01 better off.

    1. Horridbloke

      Re: Are they using the wrong datatype?

      If it was just inappropriate use of floating point maths then results would still be repeatable.

      1. Spacedman

        Re: Are they using the wrong datatype?

        Not necessarily. Floating point arithmetic is not commutative - (a+b)+c =/= a+(b+c) [https://en.wikipedia.org/wiki/Associative_property]. If you are running on a parallel system you can't be certain which order operations will be done in - it will depend on how the system has scheduled your calculation out to the processing units. I've had this happen when running statistical models using OpenMP on a single CPU, on a GPU it wouldn't surprise me at all.

        1. Anonymous Coward
          Anonymous Coward

          Re: Are they using the wrong datatype?

          This is fixed precision so A+B == B+A. I suspect this is precisely why they know it is an issue with the titan-v itself giving the wrong answer occasionally.

          https://www.sciencedirect.com/science/article/pii/S0010465512003098?via%3Dihub

        2. kouja

          Re: Are they using the wrong datatype?

          Agree, just

          s/commutative/associative/

          floating point is commutative.

          Anyway I assume the scientific algorithm is designed to be stable accounting for parallel execution. (so associative errors should not accumulate this way).

          => somerthing more than just an algorithm to blame

    2. Nick Ryan Silver badge

      Re: Are they using the wrong datatype?

      Aside from Horridbloke's point about repeatably, probably not. There are many use cases for floating point datatypes as long as the accuracy level is understood and operated within. For example a certain floating point type may be accurate to 15 decimal places and therefore calculations should be accurate to the same. Although the reality is that if you want accuracy to 15 decimal places then you need to work with accuracy at least an order or two beyond this. The same is true for financial calculations: if you want accuracy to two decimal places then either store and process everything with accuracy to three or four decimal places or perform repeatable rounding in sub-totals and uses these values for grand totals, not separate calculations.

  12. far2much4me

    I Don't See The Problem

    "Take for instance software that models molecular interactions."

    At the quantum level, there is uncertainty as to position, or even the outcome. It seems these cards are modelling that behavior.

    1. Chemist

      Re: I Don't See The Problem

      "At the quantum level, there is uncertainty as to position, or even the outcome. It seems these cards are modelling that behavior."

      Point is they're not supposed to be doing that. Even if the algorithms generate deterministic chaotic outputs that wouldn't explain 2 cards being OK

    2. Anonymous Coward
      Anonymous Coward

      Re: I Don't See The Problem

      There is uncertainty in price, too.

    3. arctic_haze

      Re: I Don't See The Problem

      Seriously, this is not how it's done. You are actually trying to calculate the Schrodinger wave function, not the measurements values. In fact you are calculating probabilities, not actual values. So the only source of uncertainty in the calculations should be the approximation in the calculation methods (for example calculating partial sums where the real values is an infinite sum).

  13. soulg

    "Engineers speaking to The Register on condition of anonymity to avoid repercussions from Nvidia said the best solution to these problems is to avoid using Titan V altogether until a software patch has been released to address the mathematical oddities." Wow. This was the most interesting part of the article. What sort of "repercussions" do the engineers expect Nvidia to throw at them? What has Nvidia done in the past to engineers who reveal these types of flaws?

    1. Sir Runcible Spoon

      Sacked them I expect :p

    2. Steve K

      Repercussions...

      They tore them into 63 bits

      1. ArrZarr Silver badge

        Re: Repercussions...

        Graphics cards are now so fast that they require a human sacrifice to make them work. Speaking negatively of Nvidia just puts you further up the queue.

        1. Shady

          Re: Repercussions...

          Ah, so that's what preemptive execution means then?

  14. TimeMaster T
    Thumb Up

    This might be the best thing since sliced bread

    GPU's good enough for gaming but not so great at bitcoin and other non gaming applications will bring the price on high end gaming graphics cards back down to something more reasonable.

    The bit miners can create a niche market for dedicated "BmPU" (Bitmining Processor Units) that I'm sure someone will seek to corner.

    1. Dave 126 Silver badge

      Re: This might be the best thing since sliced bread

      Serious Bitcoin miners, and miners of many other crypto currencies, have switched to specialised hardware generically known as ASIC - silicon which can only do one thing (eg run the Bitcoin algorithm). The development cost of such dedicated silicon is high, so only bigger miners use it. Having only a few large players in the game runs contrary to the point of crypto currencies i.e the work on the ledger should be distributed amongst millions of users so that no one entity can influence over 50% of the work. Apparently having having fewer big players over many small players leads to fluctuations, too.

      Ethereum - perhaps the largest CC after Bitcoin - uses an algorithm that is deliberately suited to GPUs over current ASIC designs because is requires a lot of memory. In addition, Ethereum keeps threatening to switch from Proof of Work to Proof of Stake in an effort to dissuade anyone making the investment in an Ethereum ASIC.

      Ironically, it was the idea that millions of people already possessed powerful CPUs and GPUs for non-mining purposes - thus combined they had more power than any Bad Actor might hope to acquire - that was central to Bitcoin.

      1. anonymous boring coward Silver badge

        Re: This might be the best thing since sliced bread

        What is this mysterious thing called an "ASIC"?

        https://en.wikipedia.org/wiki/Application-specific_integrated_circuit#History

  15. Anonymous Coward
    Anonymous Coward

    Looks at grant award.

    Looks at Tesla quote.

    Looks at possibility of getting more compute nodes instead...

  16. Maty

    As Einstein didn't say ...

    Insanity is doing the same thing over and over again on your computer and expecting the same results.

  17. Anonymous Coward
    Anonymous Coward

    Could caise serious harm

    Nvidia hardware is being used in automotive infotainment and navigation systems. It is also being used in prototype autonomous vehicle development. This type of math defect could cause serious injury if allowed to exist in any control systems.

  18. Snobol4

    First Nvidia quantum computing product

    Does Titan V therefore represent Nvidia's unexpected entry into the quantum computing market?

  19. EBG

    eh ?

    ...that models molecular interactions. This sort of code uses Newtonian equations ...

    If it does, that's their first mistake.

  20. Anonymous South African Coward Bronze badge

    Return of the F00F zombies

    there's safety in them thar hills...

  21. Dodgy Geezer Silver badge

    Reproducibility?

    ...All in all, it is bad news for boffins as reproducibility is essential to scientific research....

    Er. NO. Grants are essential to scientific research. To get grants you need to have published papers. To get papers published you need to have spectacular findings, supported by amazing data, And if your raw data is not amazing enough - just change it.

    I recommend this site: https://retractionwatch.com/

    1. Fading
      Holmes

      Re: Reproducibility?

      What you are referring to is grant farming. Not to be confused with real scientific research (easy to confuse the two as the white coats mix together like zebras on the Serengeti) . Real scientific research (tm) is low key and frequently rubs the consensus up the wrong way and has a tough time in peer review whilst grant farming only ever works within the consensus paradigm and shoots through pal review with spelling mistakes intact.

  22. Bryan Hall

    Self-driving cars anyone?

    Isn't this the line of GPU's NVIDIA was pushing to power their NDRIVE SDC modules?

    Maybe that is what UBER was using? Ooops.

  23. Wilseus
    FAIL

    Conformance Tests

    Assuming this hardware supports APIs such as OpenCL, there are conformance tests which exist in order to detect just this kind of thing. I find it difficult to understand how these problems weren't discovered by Nvidia long before now.

    1. Anonymous Coward
      Anonymous Coward

      Re: Conformance Tests

      It's a case of the left hand not knowing what the right hand is doing at NVIDIA I think. History repeats itself many times.

      http://archive.ambermd.org/201308/0094.html

  24. Stevie

    Bah!

    So some sort of run-time optimizer is picking an intermediate data type that is unhelpful?

    That would be my guess.

  25. unwarranted triumphalism

    This is Apple's fault.

  26. mjflory

    Effect on machine learning?

    An error like the one described could wreak havoc with exact calculations, but I wonder if it would make much difference in the machine learning models for which these cards are so widely used. In the early days of neural networks, "graceful degradation" was said to show their similarity to our brains, where the loss of a neuron or two has a negligible effect. A systematic error might throw off calculations of connection weights, but random errors might well have a comparably minor effect.

  27. Luiz Abdala

    Titan V for Pure Universal Bug Grep... research, then?

    Who's got a Titan V, about to toss it in the bin, willing to donate it for... research?

    You know, P. U. B. G, a linux memory dump readout effort searching for misplaced error strings...

  28. ridley

    Sometimes we need more than +

    ""All of our GPUs add correctly," the rep told us."

    No comment on /,* and - then?

    Is this a case of Nvidia being economical with the truth?

  29. Gustavo Fring
    Happy

    can still be used

    For generating the Lotto number on Wednesday and friday tho ....

  30. straphlinger

    Get in Contact via phone.

    If you are affected by this bug, please contact us on 2240..., no wait 2241...., no was it 2243....

  31. Anonymous Coward
    Anonymous Coward

    Titan V Bugzilla

    Actually I expected something like this.

    The problem with effects like Rowhammer is that the actual physics comes back to bite you.

    Fix one problem and it just causes another one, such as in the rarely documented cases of ASLR breaking down with hash collisions where the data being processed matches up by chance with the ASLR algorithm used.

    I found that DDR4 and 5 does still get problems related to heat, can cause all sorts of instability and it is well known that for high reliability applications you want to run the chips at far less than maximum clock rate to give those trillions of capacitors a fair chance for refresh to work properly.

  32. anonymous boring coward Silver badge

    "All in all, it is bad news for boffins as reproducibility is essential to scientific research. "

    Being correct also ranks pretty high.

    Being wrong in the same way every time, not so much.

  33. Oldish Git

    8087 / 80287 anyone?

    Trixy things, they floatin' points.

    IIRC, the problem first manifested in early versions of Excel.

    By the way, what's a Pentium?

  34. Claptrap314 Silver badge

    On Data Types

    Over, and over again: floating point does not exist in the part of the real world that computers actual interact with. I've built an (Intel) IEEE-754 emulator. If you don't know why I needed to specify Intel, then you don't really understand the standard. I've also done proofs of the accuracy of floating point computations.

    If you are doing currency computations, you are working exclusively with integers. But not integers of the currency base, integers of the smallest denomination. (For the US dollar, this is a mill-- 0.1 cents)

    If you are taking a reading from an instrument, you are getting a quantized result. Those translate to integers.

    The problem is that we like to write contracts that require us to divide. See the "average daily balance" for credit cards, or the financial compacts that lead up to the Euro. But even then, when the computation is done, the results are integers.

    Floating point is an abstraction with some really nasty leaks. If you are seriously concerned about last-bit accuracy in your computations, you are going to have to jump through some crazy hoops.

    The fact that Excel, MATLAB and other PHB-level applications don't worry about it means that you better not either if you use these products.

  35. Claptrap314 Silver badge

    Bitcoin mining?

    I cannot imagine a serious hashing algorithm that used floating point. This is a floating point bug issue, right?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like