16TFlops for £97m???
I know that government procurement is notoriously bad at getting value for money, but surely not that bad?
The UK's Met Office has settled on Cray as the vendor for its next supercomputer, with a 16 petaflop XC40 machine* to be shared between the Met Office and Exeter Science Park. The £97m HPC box will give the Met Office 13 times more supercomputing muscle than it is currently able to flex. The weathermen want to improve the …
My boss' words on the subject were "they can have my PC for a million if they want"
Whilst it's not only about the flops it does seem like they're overpaying by about £96 million to avoid a £500k software problem at face value.
But it's only taxpayer money so I'm sure it's fine.
Put it this way - for just over twice the price you can buy the world's most powerful supercomputer (Tianhe-2) at 55 PFLOPS (theoretical peak) before the downvotes start.
"No, they mean TFLOPS. 16PFLOPS is world top 4 territory and not something the Met Office are buying."
I'm pretty sure they aren't getting ripped off that much. The Cray Titan at Oak Ridge is ~20PF and that was $100m in 2012. It's got to be Peta rather than Tera in this case.
Yeah I actually changed my mind about this having read some more - their old kit is 140TFLOPS sooo.. In which case that's actually some serious business and everybody looks less silly.
Be nice if they could put out an accurate press release though given everybody is saying 16TFLOPS which is majorly different.
Still looks expensive compared to Tianhe-2 though :)
@Chicken Marengo - Please don't resort to hack journalist reactionary outrage.
The figure £97M is very likely the approved amount for the whole capability. That includes a lot more than the purchase price of the computer and will likely (should) cover all costs that the computer needs for its expected life. But basically there are many questions that need to be answered before you can estimate how good a value has been acheived.
First unanswered question - how long does the approval cover? The amount is meaningless if you don't you understand the approved figure likely involves the whole costs for the life of the asset or for a significant period. How much will a *fully supported* supercomputer cost over, say, 10 years? I guess you have no better idea than I do.
Then add on all the other costs - How much will it cost to building the new building (it's going in a new building) and then how much will the energy costs amount to? What about training and any software development costs to migrate to the new system. Don't forget your project risk budget, things will go wrong that cost extra money, considering there are new buildings and then factor in the cost of doing things properly, they like to do sustainable developments (e.g. not necessarily cheap upfront costs).
All these things mean that the cost sounds expensive if you think they are only talking about buying a collection of servers, but the quoted figure means so much more. Many things that make the figure meaningless to those who haven't seen the full budget.
Summary - it may be an overpriced system, it may not. We simply don't have enough to go on.
>>@Chicken Marengo - Please don't resort to hack journalist reactionary outrage.
Dear Don,
Please accept my humble apologies for offending your delicate sensibilities with my excessively Daily Mail tone. Next time I shall use the joke alert icon so you are suitably forewarned not to take anything I say seriously.
I do however suggest that you actually bother to read and understand what I wrote before jumping to your own moral outrage at my reactionary outrage, after all it was only one sentence so not exactly a taxing exercise.
To me, my subject line '16TFlops for £97m??? ' clearly suggests that not a lot of compute power is being acquired for that sum, rather than the sum itself being excessive.
As support for my argument, I reference the cluster humming away at the other end of the office I'm currently sat in with it's pile of Tesla GPUs which delivers >16TFlops and I'm pretty sure it didn't cost £97m since that is greater than our annual turnover.
@chicken Marengo - I do extend a sincere apology. I was aware you were being facietious but didn't realise it was about the order of magnitude. I guess I'm a little tuned in to the outrage of government costs by people who don't understand them. No need to label you a Daily Mail hack
Although, I would just point out that "not a lot...being acquired for that sum", and "the sum itself being excessive" are basically the same comparisons. One is saying that something is too expensive, the other that not much has been brought for the price.
> First unanswered question - how long does the approval cover?
> The amount is meaningless if you don't you understand the approved figure likely involves the whole costs for the life of the asset or for a significant period.
Actually what it costs is meaningless against what it does. If it saves lives on a regular basis it is priceless.
If it gives a reasonable approximation occasionally in forecasts about how foggy it is going to be...
Not so much.
How does it compare with this:
http://www.bom.gov.au/australia/charts/viewer/index.shtml?type=mslp-precip&tz=UTC&area=SH&model=G&chartSubmit=Refresh+View ?
If you knew how to set up your unbadged 486, you could run it on that (eventually.) The data is readily available online. OK, you would probably need something more like my 2GB RAM, 32 bit ex Vista, e-machine to get a decent reproduction but you could still be in time to forecast earthquakes and volcanic eruptions on it. More likely so as you can run the data out to any time frame.
The problem is the flower-pickers at Climatgate central are very unlikely to use it for anything so sensible. It must have been agony writing the press release telling us that we have finally lost the war. But on the bright side: Now they have told us where the ceiling is we no longer have to crash through it.
Don't get me wrong, it's always nice to have shiny! An this is something shiny indeed, but when the Met Office often struggle to get a forecast correct for an entire region for the next day ("no rain"... rain!), never mind getting down to 300m cells I would get your house in order first and refine the current models before going into more detail.
I have to argue against this. The P7 775s (there are two main forecasting systems, it's not just a single supercomputer) are perfectly adequate for what they are currently running. In fact, they are not using as much of them for the operational forecast as they were expecting that they would at this time in it's life. There is still plenty of headroom, and I believe that some in the Met Office would have liked to keep them a bit longer before replacing them. Switching vendor is a seriously long and difficult task, and the IBMs are pretty reliable most of the time.
The problem that they have is that there is a scaling issue with the Unified Model, which limits the amount of parallelisation that can be used. Scaling the UM up rapidly moves from being a compute intensive problem to a communication intensive problem, with the speed and latency of the interconnect becoming the constraining factor.
When you look at where the two existing P7 775s are on the top 500, both systems are still in the top 100 (78 and 97 - they're not quite the same size as each other), which is a contrast to the previous P6 575 systems that were in the 200's at this point in their lives, and the previous NEC system that had completely fallen off the list when they were replaced!
If you look at what the Chief Executive of the Met Office said, they are initially looking at running more forecasts (one an hour rather than one every 6 hours), which will not mean that the granularity of the forecast will get any better, at least not straight away, just that it will be running more forecasts. That's going to be a logistical nightmare for the support people!
And as I understand it, the floor footprint and power profile of the Crays is going to be much greater, so much so that they are having to move one of the (three) new clusters into the Exeter Science Park.
Their forecasts for a specific location are very good. For an entire region is impossible unless the whole region by some freak chance has the same weather. An area by the sea could be quite windy with fog, a couple of miles inland. Giving a forecast in the style of the TV forecasts is pretty nonsense really and gives nothing but a guide of what to expect.
Is looking out of the window. Actually, looking around you to see what's coming. With satellite imaging, you get a much bigger window.
Or ask a local sage - the older, the better. One of my neighbours (in his late 80s) is uncannily good at forecasting. Asked about recent "extreme weather" he shrugged and said that it's always been like this but people were "getting soft" these days.
Re: Dale 3 "2 hours ago"
But with a distributed network of such nodes across the whole country, you should be able to build up quite an accurate picture. You could call it OldGitTorrent.
Also aren't we living in a country with an aging population? Replacements shouldn't be a problem, but perhaps that's getting a bit dark.
Bollocks.
Looking out of the window may tell you what the weather will be like in the next hour. It won't tell you what it's going to be like tomorrow, let alone in five days time, and won't help at all if it is dark and overcast.
And looking at satellite images will tell you what is currently out-of-sight, but will not predict the track or change of the weather systems, nor tell you how they interact with each other. Weather systems are chaotic.
It's always been a fairly safe bet that if you were to predict that tomorrow's weather will be much like today's, you would probably be more right than wrong. But you would still be wrong sometimes. and all that your wise old sage confirmed was that the weather is notoriously difficult to predict, especially in the UK.
I suggest you ask your old duffer when to deploy the snow ploughs and gritters for your district council and see how well they do to keep the roads passable if he's better than UKMO.
When looking at historical data, it's always possible to point out when the weather was at one extreme or another compared to the forecast. And people only remember when it was wrong, not when it was right. A proper statistical analysis shows that as time goes by, the Met Office are getting it right increasingly frequently. They will probably never be 100% correct all the time, but no forecaster will ever do that until they have a perfect model of the world.
Here's an exercise you might like to try. What does a 70% chance of rain on a particular day mean.
Go on. Tell me. Then go and check what the official definition is, and see whether you got it right. And once you understand, try and come up with a way of presenting the data in a way that does not include a degree of uncertainty. Forecasts can never be certain.
And Google turned up this:
http://www.carbonbrief.org/blog/2014/02/how-accurate-are-the-met-office%E2%80%99s-predictions-a-closer-look-at-this-winter%E2%80%99s-forecast/
Please note that I do not work for the Met Office, but have links with it, which is why this is anon.
It's always been a fairly safe bet that if you were to predict that tomorrow's weather will be much like today's, you would probably be more right than wrong. But you would still be wrong sometimes. and all that your wise old sage confirmed was that the weather is notoriously difficult to predict, especially in the UK.
And Mystic Met are what? Mostly right? Erm, no, they're really not. The Met can't correctly predict the weather a week away, never mind a month, a year, a decade, or a century. Until they give out on the global warming hoax and start looking for some proper scientists to write & use their software, all they're doing is adding to the flops they employ.
You see your weather, and it may not be completely the same as the forecast.
The Met Office compare their forecast for the whole country against the tens of thousands of weather station readings that they get.
I actually trust their analysis more than yours.
If you look at the overall movement of the weather systems compared to the forecast, it's uncannily accurate. The clouds go where they say they will go, and the amount of rain that falls is close to being right, and the temperatures are normally within 2 degrees.
But what they struggle with is exactly when the rain will fall, and what this means is that as a weather system moves over the country, it may fall either earlier or later than predicted. And this means that it will rain over a different part of the country! If that is as little as a kilometre and a half (one forecast cell), you may think that the forecast was wrong, even though it was broadly correct. And that is assuming that you are looking at your closest forecast location, rather than the regional forecast that the media have to give in the short amount of time allowed.
I don't think a new computer is going to help much, just read the tail of woe over at Tallbloke's talkshop.
Met Office did not see the storm of 21st October 2014 coming
Downvote I think. This link shows that they were able to start to spot the windy conditions 5 days out and then consistently warned of high winds. So a more powerful computer would allow them to see it coming further aheadand predict wind speed more accurately. Hence tallbloke's site is an argument in favour of a new box of tricks.
I upvoted, rather than downvoted. Because in the short term (up to an hour or so out of the window, maybe 3 hours using the rain radar and extrapolating, as I often do), you're dead right.
As proof, how many times has the (local) radio, relying on the forecast, told us it's bright and sunny outside, when any fule can look out the window and see its peeing down. And vice-versa.
But for any time horizon beyond that, forget it. And your sage. And whether the cows are standing or not. You really, really do need some good mathematical modelling.
> Asked about recent "extreme weather" he shrugged and said that it's always been like this but people were "getting soft" these days.
We are not getting soft we are getting lied to. Climategate was never a tool used in journalspeak until recently. People were never worried about a few hundred millionths parts of atmospheric gases. I really don't think they are now. Yes there are always going to be evangelical and suicidal monomaniacs but WTF cares about them?
I make it ~9 cabinets per (old routemaster) DD bus. According to the marketing blurb[1], it's up to (initially*) 75TF per cabinet. So, 7.5PF, or more if they're the new buses.
* are these like windows PCs, they get slower with age?
1. http://www.cray.com/Products/Computing/XC/Specs/Specifications-XC40-AC.aspx
Having recently read about IBM's holey chip, which is an optical data conduit intended to improve communication between processors I am at a loss as to why the MET has fallen for Cray, which to my knowledge hasn't took advantage of the benefits optical chips offer.
Optical computing is the watershed moment in computing, much like silicon was. Speeds and performance will blow traditional silicon out of the water.
Several companies are blazing a trail into this tech with the ultimate goal of optical processors stacking effortlessly within a wholly optical system. When that day comes the MET's 16 PFLOP machine will be sitting atop the mechanical adding machines in a council dump.
Just because it is tech on the horizon, shouldn't preclude the take up of the many optical devices that have already proved their usefulness in ramping up data chatter between cores and such like and companies that are not including such advances in their offerings should be viewed with a very critical eye.
On the horizon, as far as technology is concerned, means that it takes at least 7-10 years for something that is being demonstrated now to make it into a commercial product. And I think that you will not get optical interconnects between (optical) chips on a board until you can lay down an optical path on a multi-layer board. If you have to have optical cables connecting everything, the computing density will go down significantly, and be a major maintenance problem.
As an aside, the P7 775 systems will have been in use for forecasting for about 3 years at the Met Office when they get turned off next year. At that rate, they will have gone through at least one further generation of HPC beyond what is being announced today before there is any chance of optical chips coming along.
As a second aside, the interconnect of the existing P7 775 systems is already mostly optical, but there are electrical to optical transceivers attached directly to the Torrent hub, with fibre running to the backplane. Inside each drawer, it's mostly electrical. From drawer to drawer is optical. The back of the Cray systems appear to have lots of orange cables (the colour normally associated with optics) as well.
It really depends on the workload. Running ensembles doesn't require tight coupling between different parts of the ensemble. If an hourly forecast takes more than an hour to produce, then you will be pipelining them, but each run will not be coupled tightly with the preceding or following run, so you can use the two sites to increase your overall job throughput.
There would still be dependencies on shared filesystems, but propagation delay between campuses would not be nearly as big a problem there as for message passing within communication-heavy jobs.
The £97M installation will ape the layout of the current systems with a couple of differences, and will consist of four clusters, two for operational forecast work, probably on the main Exeter site, one smaller test and development cluster for trialling new code and operating system/development system updates that may prove disruptive, and a fourth cluster similar in size, or possibly larger than the operational forecast machines for work involving users from outside of the Met Office (this is partly why the fund allocation is so large this time round), which is probably going to be on the Exeter Science Park.
The reason they operate two separate clusters for the operational forecast is so that they have a 'warm' standby that is kept in sync with the data of the forecast runs, so that if a catastrophic system failure happens, they can switch the forecast to the other cluster with minimal disruption. This also allows for one cluster at a time to be taken down for maintenance while maintaining the relentless sequence of forecasts the Met Offices customers expect. This is a very common layout for supercomputer installations involved in time sensitive production or commercialised services such as weather forecasting.
This 'warm standby' system will not not be unused. While it is not running the operational forecast, it will be running climate or weather model research work (which also will fill the spare capacity of the 'live' cluster). It will be used to run new iterations of the operational forecast in parallel to check for consistency with the current forecast, and to also check that there are no scheduling or sequencing errors in the new iteration before it goes live.
This layout is not set in stone. It is possible that one of the operational forecast machines may be swapped with the collaboration system on the Science Park for DR geographical isolation reasons, but that has a whole host of synchronisation problems keeping the data in sync between the systems (there is quite a lot of data involved).
As Rob Varley (CEO Met Office) has said, they will be using much of the new capacity to move from full forecasts every 6 hours to hourly forecasts. This will mean that the runs will overlap (they take more than an hour to run at the moment, and probably will continue to do so because the new system scales wide not fast compared to the current IBM system). At some point in the future, they may also reduce the cell size for both the UK and the World forecasts, but this relies on them resolving some of the issues with parallelisation that are inherent in the way that the Unified Model is written.
Do you know how many weather stations the Met Office use? I'm certain that it's more than you think.
There's over 200 full observation points in the UK providing at least hourly data (progressing to every minute in the near future), and many hundreds more that give partial or less frequent data. And they collaborate with other weather agencies to get access to thousands more observations across the world, and invest in technologies like EUMETSAT for satellite imaging.
One of the interesting things is that in the event of a power constraint at the Exeter site, the HPCs will be shutdown before the mainframe and observation gathering network, because a continuous set of observations is more valuable than missing a forecast!
"Do you know how many weather stations the Met Office use? I'm certain that it's more than you think."
I don't think so, but my point was that surely more of those - or better ones - would give us better results than a machine that crunches the same data a bit faster. All right, a lot faster, but the old machine wasn't running at full capacity so is there any point in that?
I am please as they have the Dragonfly network topology and I would agree with the Aries interconnects instead of the slow ones.
The DataWarp is helping with the application I/O acceleration, but I fear that they have gone for the Linux which will not help this. As we have seen only the last month that the Linux is not secure, and not well designed for a fast operation due to it being what we call "the design by committee" of the open source student coder.
It confuses why they spend so much money on the hardware but not willing to pay for a more professional operating system like with Windows.
Are you imagining that the computers they are buying will be supplied with Linux Operating systems designed by the distributors?
I didn't even know Cray was a Linux distributor.
I had formed some woolly, sheltered, half formed idea that the Met Office itself would be writing the operating system. I can't imagine Base or any other tools written by dedicated (even the most dedicated) amateurs being of infinite use and compatibility with ground to cloud. Someone mentioned that "weather systems are chaotic" I was going to say that the phrase doesn't mean they are difficult to analyse. But I think I had better stop now.
All recent Crays run Cray Linux Environment (CLE) it is a Centos/RHEL clone with extra tools and added drivers for proprietary hardware like their inteconnect and distributed memory model and a highly tuned kernel.
The hardware provides high bandwidth and low latency which makes it much simpler to scale out and share massive amount of memory in a single system image.
Could not find info if XC40 scales more than the SGI UV with 64TB and 2048 cores in a single system image.
You cant get those features on a commodity box with off the shelf interconnects.
Cray XCs don't share address space across large numbers of sockets like SGI UV does. The XC40 has two-socket Xeon compute nodes, and each node runs its own OS instance.
Limits to effective scalability of commodity x86 servers with off the shelf IB vary, depending on the workloads you want to support, as well as which vendor you care to ask. =:^)
Remind me. How many Windows systems are there on the Top 500 Supercomputer list?
I assume you are either joking or a troll. I cannot really think you are really serious.
I don't think Cray supply anything other than Linux on their hardware.
Nothing frightens me more than having Windows on US Navy warships. And that's as a career US Navy veteran, served wartime, and able to design, build, and code an entire machine from scratch. God knows I did it all too often while in the Navy. Eeek!!!
Look in the top500 - there are two Windows-based systems in the list; one at the bottom, one below half way.
Windows is utterly unsuitable for this kind of system:
1) Poor-to-no support for high-performance interconnects
2) What's the gui for on a compute node? Sucking resources, that's what.
3) No possibility of vendor optimisations
4) Monolithic kernel
5) Ridiculously high licensing costs
6) Virtually no ability to monitor the internals to find and remove bottlenecks
Contrary to your assertion about 'open source student coders', much of Linux and its ecosystem are the work of highly-paid, highly-motivated and highly-trained developers working at IBM, HP, Cray, Intel, AMD, SGI, Samsung, Google, US and other national labs etc. Oh, and Microsoft too. The difference is that other highly-paid, highly-motivated and highly-trained coders from other organisations can see and critique the changes.
The Cray distribution (and kernel) will be pared down, tuned to their hardware and optimised for the specific application(s) at hand. Bespoke tweaks will be added during the installation phase, to ensure the acceptance criteria are met. During the lifetime of the cluster, there will be dedicated Cray engineers onsite constantly passing information and fixes back and forth to their developers, rather than passing data to Microsoft to (hopefully) get an update at some point in the future.
There are many valid criticisms of Linux (and the open source development model); you have produced none of them.
The machine is quoted as a "Cray XC40", which between the Beeb, Met, and Cray news articles will have the following stats:
* 16 petaflops.
* 480,000 cores ("CPU" in the article).
* 140 tonnes
* 20+ petabytes of storage at 1.5 terebytes/second bandwidth.
It represents the largest Cray supercomputer contract outside of the USA (note "contract").
Of course, ranking systems like this by a few easy attributes doesn't tell you how useful it will be in the applicable field; presumably the Met have made this determination (CFD-dominated algorithms?).
I've left out any statistics in irrational units (Poodles per Olympic Swimming Pool etc).
The calculation of the position in the Top500 in 2017 seems to have been done by looking at the Top500 perfromance development charts ( http://www.top500.org/statistics/perfdevel/ ). In Feb 2017, the number 1 slot is projected to be 236Petaflops and the number 500 slot 1.4PF.
Putting 16PF in at number 250 is a bit strange, as the distribtuion of the Rmax in any Top500 list is very non-linear, with a small number of very powerful machines (which are used for applications that happen to score high on the linpack metric) and a long tail lower scoring machines (that are either smaller, or built for applications which don't score well on the linpack metric). Using the distribution of the most recent list (http://www.top500.org/statistics/efficiency-power-cores/ and change the chart type to Rmax), this new machine has an Rmax of 16PF, which is 10x the bottom, which I would guess comes in approximately at number 25. Using older lists, this exact number changes a bit, but it's fairly robust between number 10 and number 30.
I wouldn't call that "VERY AVERAGE".
Top 500 is out today: http://www.top500.org/list/2016/06/
The phase 1 systems (so not the bulk of the compute), comes in at 29 and 30. That beats all of the other meteorology HPCs, except ECMWF.
For a forecast postion of 25 (range 10 to 30), coming in at 29 & 30 is not a bad forecast! ;-)
All power consumed by computers from the 25W human brain to a 6.3 MW Cray XC40 is dissipated. This in contrast to chemical processes like Haber-Bosch ammoniasynthesis (good for 5% of global energy use) where a large part ends up in the product.
The installation of the HPC at the Exeter Science Park will influence the local climate which it is supposed to forecast.
Will this be the begnning of selfconscious computers?