Full Circle, and control of tail latency
Amusing to think that when I started in computing 40ish years ago, disk drives were addressed by cylinder-track-sector and the operating system was responsible for bad blocks, address mapping, and the like. We, ummm, kinda got away from that because when servers went from 2 or 3 disks to 200 or 300 disks in the 1990s the higher level abstraction SCSI provided was a relief, and oh by the way that OS code became a single point of failure (and never handled multiple concurrent disk failures well anyhow).
Of course, abstracting a lot of that detail into a disk drive got us a tail latency problem (my favorite was thermal recalibration of a drive, where the app was running just fine and then out of the blue one of the drives decided to take the better part of a second for internal housekeeping, and the app can wait thank you, in an era when server OS'es tended to blue screen if disks they depended on went away for mere seconds). And SSD's just magnify the tail latency, in an era where applications are far less able to tolerate it. Hence a server, with a handful of embedded SSDs, wanting to onload anything that causes tail latency back where it can be understood and controlled.
There be dragons here: I spent much of the last couple of years of my career helping out thinking through how Gen-Z (www.genzconsortium.org) would be used by the combination of servers, storage, and networking in the data center. The best and highest use of byte addressable storage class memory, once the write endurance of parts is 10^15ish rather than 10^9ish is to allow applications to read and write persistent memory directly (through hardware address mapping and protection, in the style of the way server memory has been protected for the last 30 years) rather than through a storage stack. (1) Creating a requirement for a kernel crossing to read or write persistent memory risks making the OS king of the hill while wear leveling is still required, and ossifying an obstacle to a massive performance change once write endurance improves; and (2) I remember all the code in OS's "SCSI Services" layers and the like running elevator algorithms to reorder I/Os to optimize IO/sec by limiting disk seeks...still consuming CPU cycles in the 2000s when disk arrays had had very good caches for decades. What a waste of path length and CPU cycles (which of course was finally optimized out for NVMe).