Re: hot swapping - old news
Tandem is old-news.
Unfortunately, the RAS features that used to be around in Tandem and Stratus (bloody hell, Stratus still exists!) are apparently feature that vendors do not consider useful any more.
In the same time, customers have been encouraged, for power and supposedly manageability reasons to consolidate all their systems onto ever-larger single systems divided up by visualisation.
And what happens when features fail and need swapping? Well, I/O cards can be swapped, as can drives, power supply components and fans. But once you get to core features like CPU and Memory, the only way is to take part or all of a system out of service.
Even in the modular IBM Power Systems (770, 780, 870 and 880) systems, where supposedly you can power down individual drawers, I've never come across a situation where a CPU or memory repair action has suggested just powering down the affected processor drawer, but wants the whole system powered down.
The solution to this? Well, on-the-fly workload migration is normally the current suggestion, but that means that you have to keep the same capacity as your largest system spare, and there will be performance and time constraints while migrations are carried out. Otherwise, you de-construct your workloads, and place them onto smaller systems that you can afford to have down for service actions without affecting the service.
Of course, hardware will continue to run in a degraded state now (if a CPU core or memory DIMM fails, the rest of the system may well continue to run), meaning that you can plan your outages rather better that you used to be able to do, but to restore full performance, some outage will probably be required.
If Huawai can produce servers at a reasonable cost where CPUs and memory can be replaced without shutting a system down, I can see current buyers of Power and SPARC systems looking at them vary carefully, but it will need some OS modifications to allow hardware to be disabled and not considered for work. It's possible, but will need work in the scheduler, and the memory allocation code. Power, IBM I and AIX can do some of this already, but I'm not sure that Linux on Power can, and I think on Intel, it's still in it's infancy.
But with the integration of memory and PCIe controllers in modern processor dies, system builders will have to know a whole lot more about the internal architecture of the systems to provide resilient configurations that will allow processor cores with all their associated on-die controllers to be removed without affecting the service.
I personally still favour a larger number of smaller systems, rather than relying on increased complexity in the design, and I think that, whether knowingly or not, customers embracing cloud are making the same decisions.