Now that's a right bummer...
It would have been cool to see the Tianhe-1A taken down a notch or two...
IBM has pulled the plug on the "Blue Waters" petaflops-class, Power7-based supercomputer that it was contracted to build for the National Center for Supercomputing Applications at the University of Illinois. In a statement released today by IBM and NCSA, the two parties said that Big Blue terminated the Blue Waters contract …
A K-clone super maybe? That one's already been done and delivered. I don't really understand the supers costing to see if that's feasible. Sure the parts won't be cheap but as long as it brings in more money than it costs it's doable.
Wonder what broke down at IBM though. "Financial decision by the IBM brass" doesn't really tell me much of anything. Something for a special then?
It sounds like IBM and others may have been doing projects like these as a loss leader to get interest from say the gas and oil and financials because look we built the fastest computer in the world currently. With the current recession it was not worth losing money to show how big your dick is I guess.
"The University of Illinois and NCSA selected IBM in 2007 as the supercomputer vendor for the Blue Waters project based on projections of future technology development,"
That's where the rub was. They expected, based on planed product lines, that it would cost X to build the machine in 3 years time. Only 3 years and an economic meltdown later, it now costs X+Y. The difference is too large for either bodies to be willing to absorb it.
"During the Great Recession, both companies were losing money"
I like how this somehow implies that the "Great Recession" is in the past, instead moving from worse to worserest.
If it's too expensive, and the future is uncertain, to stop it is the correct decision. One can come back later, when it is again within technological and economic reach. The building can be used to store mugs and beer kegs.
Keynesians looking for spending targets may disagree, but then again, they would smash the finished machine to bits just to have good reason to buy another one.
Why dont they talk to Fujitsu and use the SPARC super computer? It is much cheaper than the POWER7 super computer, and it has higher performance. And the research has already been done, and it is ready to be delivered. There is no need for another 5 years of R&D.
I wonder if IBM could build a super computer with Intel Xeon Westmere-EX cpus instead? They are only ~10% slower than POWER7. But much cheaper. Or wait until next year, when the Ivy Bridge version arrives, which will be 40% faster than Westmere-EX. I mean, aren't there many supercomputers today, who use Intel Xeons? They are fast and much cheaper.
If you read the article, it appears to the the interconnect that is the problem. Power7 works just swell.
If you compare the interconnect on the K Computer, it does not even match the bandwidth of the Power6 based 575 systems that IBM shipped three years ago, which is a fat tree with a multi-path point-to-point bandwidth of 20GB/s, not the paltry 5GB/s that the K computer has (yes, these figures for the IBM 575 with 8 planes of DDR 4X Infiniband are correct, and that is Bytes not Bits).
The Torrent in the P7 775 is targeted to provide 4-5 times the bandwidth. If you actually understand HPC type work, you will understand how important bandwidth is in MPP HPC systems. Most HPC systems working on multi-threaded modelling tasks spend significant amounts of time in spin-loops waiting for the communications system to synchronise between adjacent threads. As the amount of parallelization goes up, so does the need for communication.
Unfortunately, the tests that are run to get the top500 ranking are not actually representative of real-world type problems, and as far as I know, the figures are scaled from partial runs. The site that I work at has never committed all of the resources of either of the main clusters to produce figures, yet they appear on the list!
If IBM can't even deliver a 1 petaflop HPC cluster based on Power7, how would it deliver the estimated peak performance of 16 petaflops?
You can say whatever you want about japanese K Computer downsides on the interconnect, but Fujitsu did delivered it fairly within schedule and reached 10 petaflops on top500 benchmarks. IBM is advertising p775 model for some time and now it can't deliver a configuration with 1/16 of the capacity. I am sure the other clients IBM said are lined up for these machines are very worried right now about their orders...
The article says nothing about IBM not being able to deliver the cluster. It says they decided not to, because it was more expensive than they had forecast.
There already are 1+ PF Power-based clusters (Roadrunner, FZJ). There's no evidence of any obstacle to building a Power7-based one, other than cost.
But thanks for playing.
You and I have a contract. I pay you 1M you deliver me 2M widgets in 6 mo. If you cannot deliver that within the time-frame and budget, then you did not deliver.
IOf course attitudes like yours are the reason no government project is anywhere near budget or on-time.
The Xeon E7 is up to 160% more performance and 50% of the cost of a T5440 and up to 600% of the performance at 50% of the cost of an M4000.
How many SPARC boxes are in the Top 500? I dont have the time to look but i bet single digit if not zero. (You cant count the Japanese VIIIfx cluster because that chip will never be a commercial offering as all future SPARC64 chips have been cancelled)
I think this article should be made compulsory reading in government departments.
You'll save much more face by admitting something's not working/going to plan and stopping early, rather than digging your head in the sand and hoping no-one will blame you when it all goes pear-shapped.
Ah, but sometimes the government overspend before the cancellation is not innocent. Not all civil servants are stupid, some of the ones I have worked with are very sharp, keep their ear very close to the ground, and are surprisingly good at predicting how political machinations will affect their projects. Others are just terrible at the same, but that's not to say all are. In many occaissions I have seen behind-the-scenes horsetrading, where bits of projects that were known to be endangered were implemented to benefit other projects or to get other interested parties onside. For example, I worked on one project years ago where we we planning on some thumping big servers, which required a thumpingly fast new network. Our civil servant PM wanted the network as it would benefit several other projects coming down the pipeline, so he carefully ensured the implementation phases were changed to get the network kit bought and installed first (much like the Uni of Illionoid "just happen" to have a nice new building). When the rest of the project was cancelled, he threw up his hands and feigned ignorance, then happilly went on to use the new network for the next set of projects.
POWER7-based HPC, IMHO, is just justified for old algorithms that can't be properly distributed among nodes since need to be run on a single core, so without even using shared memory. But anyway, for this kind of tasks, you don't need an interconnect, just buy a POWER 780 with the highest CPU clock available...
In all other cases it is hard for both POWER and SPARC to beat the price competitive x86 processors. The challenge of a high throughput and low latency still remains, but for using technologies like Infiniband all CPUs architectures would share the same interconnect, thus returning to the price/performance/watts/cooling comparision!!
"But anyway, for this kind of tasks, you don't need an interconnect, just buy a POWER 780 with the highest CPU clock available.."
For what kind of tasks? Most HPC tasks, require a balanced: CPU, interconnect, (for message passing), and I/O performance, so to say these codes just need a Power780 is pretty shortsighted. If you can get the fastest CPU then you can get a way with a less efficient interconnect. If your code is I/O bound then it doesn't matter how slow you CPU or interconnect are, if they are constantly in I/O wait.
As regards x86 and shared memory, memory addressing on XEONs maxes out at 16TB, and for something the NCSA is buying, I would be surprised if that was big enough.
Unfortunately, too many people, even those working in the IT industry, don't understand the unique challenge that HPC systems pose.
It's perfectly possible to put a load of processing power together in a room and sting it together with something like gigabit Ethernet, and claim that it is an HPC cluster, but that is no way of producing a system able to do the most demanding tasks.
For those who don't know the distinction, let me explain.
There are two main types of work in HPC. One which can be completely partitioned (think SETI or any of the community grid projects), and the other where each 'cell' requires an exchange of data with it's neighbours for each loop iteration, effectively making a mesh of interconnected threads, all linked together.
The first type can be fully distributed to almost autonomous systems with a low bandwidth interconnect used to distribute the data and gather the results. The Linpack benchmark used for the top500 is an example of this type.
The second needs some terrific interconnect, and the bigger the model and the finer the granularity (and the greater the number of cores), the greater the interconnect bandwidth, and the more important the topology of the interconnect becomes.
Where things have been going wrong is that the power of and individual core has been pushed about as far as current technology can go. Power6 was a beast of a processor. But it is debatable whether a single Power7 core is faster than a Power 6, and like Intel and Fujitsu, IBM has switched from faster to more cores. The problem here is that to do the same amount of work, you have to use more processors, and then have to make the code more parallel. Many real world problems do not scale well to many processors. Often, re-writing them from scratch for models that have been developed over decades is effectively impossible. The time necessary to re-write something like the Unified Model (weather model used by many of the worlds weather bureaus) will span two or three generations of supercomputer, so you can never know what you are writing for!
And more cores mean more load on the interconnect and a higher bandwidth requirement.
The other thing that Blue Waters was to do, as an important part of the PERCS initiative, was to increase the processor density, increase the energy efficiency, and decrease the cooling requirement for computing power.
When comparing a Power 6 HPC based on P6 575s (which was already quite power and space efficient) with a planned Power 7 HPC based on P7 775s (these are NOT figures from Blue Waters BTW), for a system with 3-4 times the computing power, the Power 7 system was going to use the same electrical power, have about 2/3 of the cooling load and occupy approximately 1/3 of the floor space. I don't know how that compares with the K Computer.
IBM set themselves a tall order, and have had difficulty in delivering. But much of the work to make this happen will not be wasted and that was part of what PERCS was all about. The techniques will probably appear in various forms on future x86_64 and Blue Gene systems. We are already seeing water cooling proposed on Blue Gene Q and IDataPlex systems. But it does mean that Power 8, when it happens, will probably be targeted at commercial workloads rather than HPC.
It will be interesting to see whether IBM will now switch their main effort to making Blue Gene their primary HPC offering.
"....But it is debatable whether a single Power7 core is faster than a Power 6, and like Intel and Fujitsu, IBM has switched from faster to more cores. The problem here is that to do the same amount of work, you have to use more processors, and then have to make the code more parallel...."
Yes, I agree that "IBM has switched from faster to more cores". Of course these cores must be lower clocked for the cpu to stay in the same thermal envelope. It would not work to have higher clocked cores, and many of them. The only way to have higher clocked cpus is to decrease the number of cores.
But most cpu vendors agree that multi core cpus is the way forward, to increase the number of cores. Not, decrease the number of cores and clock them at 7GHz or higher. Highly clocked single core cpus is a dead end. I think everyone understands this, including IBM. Maybe some persons dont agree on this, though.
Regarding Fujisu's K super computer power usage, I have read somewhere that it has a good performance/power ratio. The SPARC cpus are using 58 Watt which must be considered quite low, yes?
Regarding the interconnect problem that the IBM super computer has, yes, that has always been the problem. The more nodes you connect, the more difficult it will get. I am not surprised that IBM has scaling problem. Everyone has scaling problems, not only IBM.
I think that Tilera with their 100 core cpus is interesting. Tilera has focused on fast interconnect between many weaker cores, and has not focused on fast core performance. Thus, the core is weak, but Tilera achieves very high performance. Maybe it is viable to have many weak cores? Maybe focus should be switched from fast cores, to good interconnect between cores? Just as Tilera has done.
Maybe IBM has not focused on good interconnect, because it seems that IBM's super computer has scaling problems when they add more and more nodes. So, is it better to have fast cores with bad interconnection , or weaker cores but good interconnection? If you want to scale, weak cores are not the problem.
Quite a good explanation of the differences of workloads in HPC.
But with regards to POWER7 and the POWER 775 versus the 575 I think you are wrong. Sure POWER6 is faster clocked than POWER7 is.
But Judging from the benchmarks released, then POWER7 is at least as fast on a per thread level as POWER6, and most likely a good bit faster. But it's not huge, and if you have badly compiled code that runs in GHz then sure :)=
Now comparing the POWER 575 with it's newer version the 775 then it's x8 the memory and Processor cores.. so I guess your numbers aren't really that accurate.
I don't know where you work, but I am involved in a Power 7 775 rollout. Yes, individually, each drawer of 775 has much more memory and CPU, but each drawer is being divided up into 8 system images (called rather misleadingly 'nodes'), one for each QCM and Torrent chip pair. I believe that this is because of the number of CPUs AIX can schedule with SMT turned on, but I cannot confirm this.
Each of these system images is being delivered with the same amount of RAM (not the full compliment) and 32 cores, exactly the same as our 575s, but instead of 12 system images per frame, we are getting 3x4x8 or 108 system images, which is actually 8 times as many system images per frame, although the frames are larger. In addition, the persistent storage is laid out completely differently and much more densely.
Because the extra power in the 775 cluster is being divided into more system images of broadly similar size, the way that the problems are decomposed becomes more important, and I am told that there are problems with the code here as currently written which limits the benefit of using more systems for single jobs. The new cluster will definitely be able to do more work, but the benefit to individual jobs is a little less obvious, and really revolves around how much better the HFI is over Infiniband.
We have some preliminary figures for indicative benchmarks on 775, and although I am not totally up to date, they are showing mixed results, although broadly positive. It's complex, because CPU speed is lower and whilst out-of-order execution may make a difference (although the IBM compilers with Power6 optimization turned on make a damn good job of organising instruction sequence, especially when the application is carefully crafted to match the available execution units), the interconnect is faster but has less determinate latency, and the amount of available RAM may be slightly less (complexity here, we may not get a full compliment of memory). Hopefully, as the Power 7 compilers further mature, the code should run faster.
I see the Power7 advantages as being mainly for commercial applications, although the 64 bit vector unit may make a bit of a difference. In my view SMT and out-of-order should work best with threads of different executables, rather than several instances of the same thread lock-stepped together where you will get multiple simultaneous demands for particular instruction units. Also, as many of the processes we run are actually memory bound, we often artificially restrict the number of threads to effectively disable SMT (yes, we can turn it on and off dynamically, but for some reason, it screws up the logical-to-physical CPU mapping on 575s). 4 way SMT is unlikely to provide a huge benefit without more memory, even with the increased number of execution units.
The Linpack benchmark, unlike our workload, runs multiple uncoordinated threads, so will benefit from out-of-order execution, so the figures in the brochures may not reflect real HPC problems.
I am involved with this in-flight project, so I must post this anonymously, but we have broadly agreed in El Reg comments before.
First, I must admit that my knowledge of the POWER 775 is kind of like on the car magazine level. I haven't seen any material on any IBM sites about the box. There haven't been written any redbooks, there aren't any manuals online yet. So....
But I am a bit puzzled by what you write about the machine being split up into 8 images. I know that if you order the machine with preloaded images. That is what you'll get, from eConfig per default if you don't specify something else. But if you specify one Image.. eConfig doesn't protest. But you can verify that pretty quickly if you can have one image that can span a whole machine, by simply having a look at the HMC, and see what it'll let you do.
Now AIX scales to 1024 threads quite nicely, we got our first POWER 795 here some time ago.. and it got test booted up with one virtual machine with 1024 virtual cores. Wroom Wroom.
With regards to QCM's, you do mean DCM right ?
Now with regards to tuning code. All I've done is to read this paper:
mostly out of curiosity. I haven't done HPC work for... like 6-7 years since I was a consultant. Today being an architect it's all about spreadsheets and power point and explaining to managers how things really work in the real world.
With regards to turning SMT on and off.. yeah, half or 3/4th of you processors will disappear, so that you'll have to fix any scripts etc. that assumes a continuous list of processors, so no for x in `lsdev -C....` but hey... :)=
But to use SMT or not to use SMT on HPC computing depends on what you are doing.. if you are only doing FP and that is what you need to do as fast as possible, then it's hard to get to much benefit :)=
I have never been told the reasoning behind the 8 OS images per drawer, just been told how the design and layout of these is being delivered from Poughkeepsie. IBM have a very strict idea of the way that these systems are deployed, and will normally do the initial installation themselves before handing them over to the customer as a working cluster.
I have not yet seen one of these systems to log in to, because I am not in the IBM 'hit squad' that they are dropping in to deploy the systems, so have not been given any early access, and our deliveries of real systems are scheduled for later in the year. I have had access to some architectural details, although not as much as I would have liked.
It is quite clear that the layout of the memory is organised around the QCM, so it may make sense to define a system as the QCM, the memory clustered around it and the associated Torrent chip, to reduce the load on the L-Links between the Torrent chips for cache coherency.
QCM's. They are Quad Chip Modules in the 775, i.e. 4 Power 7 chips on a single module, giving you 32 cores per module. There are 8 QCMs per drawer, giving you 256 processors in a single drawer, and 1024 in a Super Node. All of the CPUs on a single QCM are on processor busses contained on the QCM itself, and all communication to CPUs on other QCMs have to go via the Torrent chip, as does access to memory attached to another QCM. This will give different access times to memory depending on which QCM it is attached to.
In the photos that came out of Supercomputer '10, where you see 16 large chips with water-cooled heat sinks, these are 8 CPU modules and 8 Torrent chips, not 16 CPU modules.
When I look at the number of CPUs that can be scheduled by AIX, if the number has been raised to 1024, with 4-way SMT, then you could run a complete drawer as a single system, although not a Supernode. I was initially confused, because I was under the mistaken impression that the Power 7 chip would do 8-way SMT, not the 4 way it actually does, so some comments I have made in the past were plain wrong, and the Register does not allow comments to be edited, only withdrawn. I apologise for that.
That paper you have quoted is for 755s. These are smaller systems, and the memory layout and system interconnect is very different, and they rely on 2 planes of Infiniband (in the potted configuration that IBM sells as cheap[er] HPC systems) as the system interconnect. The 775s probably will be clocked at a different speed, and the HFI interconnect is very different, being effectively an all-to-all mesh, although the details of this are not clear in the presented material.
I am not an HPC application writer. I merely support the systems. But I do often have lunch with the people involved in tuning the code. Some of them have already had access to 755s and a small amount of time on 775s. They say that the code they write runs very differently on the 755s, and cannot be used to get a representative figure for the code on 775s. But the type of code they are running spans many machines, and very little of their code will run on a single 755 system, so comparisons are not valid. Certainly, I agree that a single Power 7 core is very potent
The L-Links and D-Links that tie a Supernode together do not appear to work like the Flex-Bus on Power 5 and Power 6 570s. Each drawer in a Supernode has it's own service processor, and it's own Ethernet connections to the HMC (yes, they use HMCs not SDMCs), and as far as I can tell, is not able to use any I/O resource in another drawer (I could be wrong on this point). This makes me think that even if AIX were to be able to schedule more CPUs, then they would not be able to scale beyond a single drawer anyway.
The comments I made about SMT we because when running using LoadLeveler on Power 6 systems, you can turn SMT on and off using directives in the LL job file. Normally, with SMT turned off, you only get even numbered CPUs numbered 0, 2, 4, 6...62. With SMT turned on, you get 0-63 as expected, with logical CPUs 0 and 1 on the same physical CPU. But if you start with SMT turned on, and turn it off and then back on, you lose this neat mapping, and can find that logical CPUs 0 and 1 may appear on completely different physical CPUs, and we have seen strange situations where with SMT turned off, we no longer get even-only CPUs configured.
Unfortunately, although AIX appears to make a good stab at distributing threads evenly across physical CPUs when this logical to physical mapping is regular, as soon as it is upset, you can get two threads on Logical CPUs on the same physical CPU even if other CPUs are completely idle. This has led to some very strange and non deterministic job times! And the only way to fix it that we found was to re-IPL the system. Clearly not something we want to do on a busy cluster.
As a result, we keep SMT turned on all the time, and control the number of threads using MPI or OpenMP directives in the code, rather than using LL.
One problem is that when using the optimised maths libraries, some of the routines will be multi-threaded themselves, so even if we attempt to control the number of running threads, sometimes we will get a flurry of threads, and unfortunately, the way the code is written lock-stepped, all of our threads will hit similar points at the same time, and the system can be overloaded with more threads than it has logical CPUs, causing context switches. Very undesirable.
Biting the hand that feeds IT © 1998–2020