Arithmetic, anyone?
Interesting piece, but I worry about any table of results that does not appear to be internally consistent....
The piece does not define what exactly is meant by "latency", but it is odd that the results in the second and third columns (both labelled "latency") are precisely 1/2 of the transfer time that one would compute using the complicated mathematical formula "time = quantity / rate". In this case, for example, on would expect that transferring 16 KiB at a rate of 2000 MB/s to take 8.192 microseconds, rather than the 4.09 microseconds stated in the table.
Latency might mean "time for the first bit of the data to arrive" or it might mean "time for the entire block of data to arrive". The latter is not possible given the stated numbers (since all the computed transfer times are precisely twice the stated "latencies"), while the former would imply a mildly perverse buffering scheme that always buffered precisely 1/2 of the data before beginning to deliver it to the accelerator.
Buffering exactly 1/2 of the data is perhaps not as crazy as it sounds -- such schemes are sometimes (often?) used in optimized rate-matching interfaces. If the input is guaranteed to be a contiguous block, then buffering exactly 1/2 the data allows the buffer to transmit the output data at 2x the input rate after pausing for 1/2 of the transfer time. Such a scheme minimizes the latency between the arrival and delivery of the final bit of data in the input block. Unfortunately, it also makes the actual hardware latency invisible (provided that the hardware latency is less than 1/2 of the transfer time of the smallest block with reported results).
Whether this buffering scheme makes sense depends a lot on the data access patterns of the subsequent processing steps. If the subsequent step demands that a full block be in place before starting, then this is the way to go. On the other hand, many signal processing algorithms could pipeline operations with data transfers in smaller blocks, in which case a different buffering scheme might make more sense.