back to article To heck with the laws of physics... we will squeeze more juice from these processors

Dratted laws of physics. Cranking up frequencies is difficult due to leakages from ever smaller guard rails on the electron highways inside the processor. You have to jack up the power to make sure the instructions make it through, which leads to thermal problems. Martin Hilgeman, HPC consultant with Dell EMC, gave a tour de …

  1. John Smith 19 Gold badge
    Unhappy

    "requently encounter data sets with very sparse matrices."

    For which much more efficient methods exist than storing it in the obvious n-dimensional array.

    In fact in big apps with multiple matrices being processed together the first task is an optimization to find a better order in which to multiply them.

    This can make a very substantial difference to how much processing is done.

    1. Anonymous Coward
      Anonymous Coward

      Re: "requently encounter data sets with very sparse matrices."

      Unless, of course, you don't KNOW your matrix has a lot of zeroes in it. Doesn't it take processing power again to figure out your matrix has a lot of zeroes in it in the first place?

      1. John Smith 19 Gold badge
        Unhappy

        "Doesn't it take processing power again..matrix has a lot of zeroes in it in the first place?"

        Depends on the class of problem you're working on.

        Some problems are inherently sparse, others would need to scan the arrays needed.

        Some perspective from "Algorithms" by R. Sedgewick (Ch 42. Linear Programming). Multiplying 6 matrices together required 274200 multiplications in one ordering and 6024 in another ordering. None had more than 4 rows or 3 columns.

        If you're doing serious matrix work some prep work, even down to the ordering of matrices, before you start multiplying them out makes very big savings.

  2. boltar Silver badge

    With non managed, non VM languages and less bloated APIs...

    ... we could probably squeeze twice the performance out of the current generation. But some coders just can't cope without their hand holding managed VM languages and/or scripting languages.

  3. John Smith 19 Gold badge
    Unhappy

    "we could probably squeeze twice the performance out of the current generation. "

    Depends.

    Classic big matrix mashing apps are coded still coded in FORTRAN which is normally fully compiled and quite good.

    Theoretically JIT compilers allow development in "managed" languages but can be coded much closer to the raw hardware if the option is engaged.

    The best advice is still code the app as simply as possible then profile it to find out what really taking the time. History has taught me the answer to this question is normally "Not where you think it is." IOW any time spent on "tricky" coding up front is probably in the wrong place. As a bonus it may have handicapped the translator from spotting optimizations it would have applied to a straightforward version and therefor give you a slower program (as well as one that's was harder to write and harder to understand if someone else has to debug or upgrade it).

    1. boltar Silver badge

      Re: "we could probably squeeze twice the performance out of the current generation. "

      "Theoretically JIT compilers allow development in "managed" languages but can be coded much closer to the raw hardware if the option is engaged."

      True, but there's still overhead even with JIT. Eg: boundary checking, garbage collection.

      "IOW any time spent on "tricky" coding up front is probably in the wrong place"

      Sometimes, not always. Obviously if a program is I/O bound then no amount of fancy coding is going to significatnly speed it up, but if its CPU bound then you can work with the compiler and profiler to tightly optimise the relevant part of the code.

      1. John Smith 19 Gold badge
        Unhappy

        @Boltar

        "True, but there's still overhead even with JIT. Eg: boundary checking, garbage collection."

        It depends wheather or not those ease-of-use/reliability improving features outweigh the performance hit having them causes. For some the performance hit will be too great, for others it will be acceptable and outweighed the amount of time they will spend hunting down an intermittent bug that causes array accesses to go haywire.

        ""IOW any time spent on "tricky" coding up front is probably in the wrong place"

        Sometimes, not always. Obviously if a program is I/O bound then no amount of fancy coding is going to significatnly speed it up, but if its CPU bound then you can work with the compiler and profiler to tightly optimise the relevant part of the code."

        I was actually talking about people who alter their coding plan based on what they think will be slow, without profiling their code first.

        The experience of developers from DE Knuth to Steve Connell is that the bits they thought would be slow, when they did profile them, turned out not to be. The bottleneck was never where they expected it to be. It was somewhere else.

        What you're talking about is re-coding after you've found the hot spots with a profiler, which is best practice. Implement as simply as possible (to make sure the results are correct) to begin with and then optimze those parts that will make a serious difference, the proverbial 80/20 rule (or in some of Knuths work the 95/5 rule IE most of the run time was swallowed by just 5% of the code).

        BTW there are many places for a program to have bottlenecks. Looking at various programs it seems the biggest speed up is to step back and decide if the basic algorithm is right for the job. changing that seems to have the biggest influence, but only after you've profiled the basic version.

        1. Alan Brown Silver badge

          Re: @Boltar

          "Obviously if a program is I/O bound then no amount of fancy coding is going to significatnly speed it up"

          If the I/O bound is caused by trying to work on too much data in the first pass - most of which you're then going to throw away - then fancy coding (as in changing the order you work on things) makes a huge difference.

          This is one of the primary reasons for optimizing database joins and selects. Get it right and you can see speedups of 100x or more.

  4. Anonymous Coward
    Anonymous Coward

    I think the author will find the issue has been Intel playing 'Happy Monopolies' all along.

    AMD recent news shows us a great deal of bandwidth is on the way.

    1. TechnicalBen Silver badge

      Physical limitations vs economical.

      Yes to an extent. There are physical/mathematical limitations and tradeoffs. However there should be a little more room before we hit the physical requirement of going optical or quantum or magnetic in processor substance.

      Say when your entire mobile is "on a chip", so you don't even have all those pcbs in it, and it's the size of the Apple Watch or smaller. As I assume that could be done today, but you'd have to pay for the slice of silicone and the fact it's custom to one device/design only (no scavenging the chips for other devices). It's expense that stops us doing 99% of the work on a single chip (and in a similar way stops us all going 100% SSD etc).

      Being able to use generic ram/memory chips, generic power regulation etc just is more economical right now than doing a separate chip for every single compute design going imaginable.

  5. Alistair Silver badge
    Holmes

    Ummm boys and girls

    AThe real problem, as Hilgeman points out, is memory bandwidth per core.

    ... Something I've been saying since the CoreDuo......

    1. John Smith 19 Gold badge

      ""The real problem, as Hilgeman points out, is memory bandwidth per core.""

      Well sort of.

      In principal you hook each processor up to its own block of ram and problem solved.

      But IRL most significant problems have to share some data (at some point) between processors. I don't think even SIMD systems are immune to this. IOW It's all about "contention" and how you deal with it. Essentially it comes down to 2 options.

      a) Single copy of data item. Every processor that wants it forms a queue.

      Fine if they are all reading it but if they are reading and writing then the final result could depend on what order the processors are accessing the data.

      So maybe you lock out all further writes until all processors (who are not writing this location) report they have now processed the new datum, at which point the next write happens.

      I have no idea how to make this happen and the datum could be a single word up to a whole data structure.

      b) Multiple copies IE each processor has a cache.

      But how do you efficiently inform all the other processors (who may have mapped different parts of the main memory to their caches) which part of main memory you've updated and that they should update their copies as well?

      The Transputer architecture still looks pretty good at handling these problems. Separate memory spaces, good mix of stack and local register architecture, hardware scheduling with 2 level system. internal DMA with multiple channels (and in principal the ability to virtualize the channels)

      Too bad it never got a decent MMU.

  6. This post has been deleted by its author

  7. Pat Harkin

    I think I understand...

    ...we have to compute smarter, not harder.

  8. Alan Brown Silver badge

    The issue isn't just bandwidth

    Memory latency has barely changed in the last 20 years. When a processor sends out for data and can spend thousands/millions of cycles waiting for it to actually arrive, there's a obvious area for improvement if this can be solved.

    (DDRn means you can get more words in per request, but the ~60ns latency for random requests hasn't changed. There are lower latencies if you can get words from an adjacent row or in sequential order, etc but REAL chances of this happening in a multiuser system are slim to negligible.)

    Yes, profiling and the 95/5 rule still apply, but this is one of those problems that solving would result in across the board improvements in performance - and it could also result in simpler processors. A large chunk of the support logic (and power consumption) inside a modern CPU core is dedicated to trying to predict what addresses will be asked for next and having it ready before the ALU asks for it. This kind of predictive prefetching and pipelining isn't terribly successful at predictions (usually about 30% at best) but it's still better than not trying at all, although longer pipelines looking further ahead isn't the answer -Netburst proved that.

  9. John Smith 19 Gold badge
    Unhappy

    "one of those problems that solving would result in across the board improvements"

    Agreed.

    There seems to be a deep disconnect between the algorithm writers view and the hardware people. DRAM's problems have not substantially changed. You take a big hit when you cross row boundaries and data alignment is difficult to control. Likewise once you've got that row being output you find it's not that fast (that said Samsung said their latest are toggling at 9GHz, which does seem fast).

    For actual parallel algorithms what I think is needed is something like the hardware equivalent of the "Publish & Subscribe" model, but I'm not sure if it should be in terms of the data or the program PoV.

    Somehow a program indicates "I want to know if this data item (which is not in my local address space) has changed so I can use it. Likewise the program needs a mechanism so that when it writes a new value of a data item that others have requested they (and only they) get a copy of the new value. How do you retro-fit that to the shedload of legacy code out there?

    I'm sure this has been proposed repeatedly but I've never seen it done because it's a monumental PITA to implement fast enough in hardware, when a "data item" could literally be from the smallest addressable unit of memory up to a very large record (you'd want some way to say "I want to know when a matrix element changes, not the whole matrix. That's a given".)

    Maybe a "Harvard" architecture with separate data and program spaces? A large shared "smart" (how many ways is that word overloaded?) data store? So every write is a data write and it's a question of figuring out which processors need to be told about it

    The problem remains. You want a single big block of memory you can hand out to whatever processor needs, however much it needs (so you can accommodate that huge model, even if the code to process it , running on your army of processors, is quite small) but you don't want all the delays you'll get with contention.

    I don't really believe you can have that total flexibility with multiple processors and have maximum performance. Directly linking a chunk of memory to a processor limits your maximum code size but guarantees no contention. I believe you can have better, but you can't have it all.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2020