Re: FP4??!!
Normally you use the FP4 to compress the weights by which you're multiplying an FP16 vector and accumulating into an FP32.
I'm slightly surprised that they use a fixed FP4 format - using the four bits as index into a float16[16] would be a lot more flexible, but maybe the cost in silicon for that many multiplexers is actually perceptible whilst it's just wires and a tiny lookup table to convert FP4 to FP16.