Goes back to Bell System ESS1
Where "Auditor" programs swept the exchanges memory looking for corrupted data structures (which was how it kept track of the calls and services) and returned them to the free storage pool, possibly terminating a call in the process.
It appears Cloudfare have generalised this to their whole data centers. LSI used to supply large drive arrays with fail-soft capabilities at the turn of the century.
Honestly I don't get why, in the 3rd decade of the 21st century, most (all?) major infrastructure suppliers haven't automated their processes.
All routine tasks should be at least semi-automated by now using one or other available tools.
Experienced, skilled staff are the major asset of such businesses.
Mines the one with a copy of Glenford Myers "Software Reliability" in the (oversize) side pocket.