Re: Can quantized models run on GPU?
The process described is to take the trained fp16 weights and compress them to occupy less memory, because memory is one bottleneck. However, during inference each weight will be converted to a fp16 value (*) for the purposes of GPU calculation - thus taking more time because of the conversion. (* Not the original fp16 value obviously, as that was lost in quantization).
A more recent <i>preprint</i> (not yet peer reviewed) paper came out on June 18 - "Scalable MatMul-free Language Modeling" [on arxiv (dot) org] which describes a binary implementation on FPGAs - not using a GPU. It claims:
<i>For the largest model size of 13B parameters, the MatMul-free LM uses only 4.19 GB of GPU memory and has a latency of 695.48 ms, whereas Transformer++ requires 48.50 GB of memory and exhibits a latency of 3183.10 ms. These results highlight the efficiency gains achieved by the MatMul-free LM, making it a promising approach for large-scale language modeling tasks, particularly during inference.</i>
That's a tricky claim because "MatMul" is actually just being replaced by "MatAdd", and difference in time between fp16-multiply and fp16-add is minimal. The O(n*m) complexity of the matrix computation does not change, (or possibly gets worse because more parameters are needed as each parameter carries less information).
It is notable that the paper only claims a savings in inference <i>throughput</i>, only <i>latency</i>, and it doesn't mention the training time (so it might be much worse?). In general FPGAs are orders of magnitude slower than general purpose mass produced chips like you find in GPUs.
Nevertheless, the most deadly poison to hype is un-peer-reviewed anti-hype-hype, and the day that paper came out (June 18) NVIDIAs stock dropped from $135 to $118, although it's back at $129 now.
What the LLM trainers could do is train a compact cellphone friendly version of their LLM with weights quantized to -1/0/1 during training to get the optimal result possible with such weights - but probably not a big monetary return for that at the moment.