Re: Bigger is better
It was a bit of a ramble-rant. I'll try and slice it differently... :)
1) Our HPC type workloads are the opposite, they vastly exceed the capacity of a single host.
+ Thus the blast radius of a host spans a fraction of the workload.
+ Checkpointing and rescheduling the affected portion of workload to another host is how we addresses these failures (quick & efficient).
2) Typically VMs (and zillion core hosts) are used to aggregate lots of workloads that are *much* smaller than the capacity of the host together.
+ The blast radius of a host spans a lot of workloads.
+ VM migration can be used to mitigate this problem if you have some spare capacity (slow and burns compute host resources in the background).
The Blast Radius of a Zillion Core host is not a problem for HPC because your workload vastly exceeds the capacity of that one host and it will inevitably been engineered to tolerate a bare-metal host failure (this *will* be battle-proven because the MTBF of a set of hosts is *MUCH* lower than the MTBF of a single host).
Meanwhile in the land where you are using Hypervisors to aggregate multiple workloads together and use VM migration to provide resilience to hosts exploding, the use of Hypervisors can actually multiply the blast-radius because they are necessarily tightly-coupled in order to do the complex task of migrating across a few thousand processes & TB's of working set... Vendors / operators are reluctant or won't support the migration of workload between VMs at different version/patch levels. Thus if you need to patch the Hypervisor layer you will need to patch hosts in batches - thus multiplying your effective blast radius as that patching happens (in practice that happens at rack level - so the blast radius is 16x bare-metal in the event of VM migration breaking or Hypervisors needing an upgrade).
... rant bit ...
Suffice to say, in practice, the real blast-radius problem wasn't a problem for our HPC workloads on bare-metal (because the hosts are loosely coupled and migration is cheap) - but it *has* become a problem for our HPC workloads that run under VMs - because the Hypervisor layer has multiplied the blast radius from taking down one host to taking down a rack at a time (because the hosts are now tightly coupled at the VM layer). We've seen this manifest itself as outages when running our workloads under VMs - something that the bare metal hosted workloads haven't experienced since a DC burnt down several years ago.
I suspect the main driver for our org forcing us to run stuff under private Cloud is to increase their utilization figures as our HPC workloads achieve > 90% 365x24 on bare metal. The Cloud utilization figures when from ~17% to ~32% when a small portion of our HPC workloads were migrated to Cloud. ;)