RoCE is under 2 microsecond
I was looking into this for a zfs zil.
From what I understand, a typical use case would be direct attached ethernet, i.e. host to host instead of through a switch, that way there are no bottlenecks.
The ethernet packet does contain an IP packet with a udp packet.
The protocol keeps the sequence number from TCP, but otherwise uses UDP. It is assumed that the connection is usually reliable, so instead of acknowledging each packet, the server only sends a response if a data packet is missed.
For a bunch of file servers you would have a top of rack switch. You may also have a pair of top of rack RoCE boxes, with all of the other storage servers in the rack sending them their write transaction log. In case of server outage, the failover server on the san would read from the transaction log. and continue where the first one left off.
In case of paired servers with primary and failover, a server could mirror an internal memory space for transparent failover.
client request for foo.txt with write via smb.
samba mark in memory that file will be in use for write
queue to nvme (fastest confirmation is 150microsecond to nvme) lock
send to paired RoCE servers lock event
wait 2 microseconds
inform client of lock
148 microseconds later nvme write confirms.
historically 2/3 of file server IO events are lock files.
replacing your lock files with RoCE can effectively triple your effective IOPS.
If you combine this with a file system that utilizes write combining (zfs), for a single threaded task, this can theoretically speed up your task by 70 times.
-Michael McMillan
mikegrok on linkedin