back to article Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it

If you hop on Hugging Face and start browsing through large language models, you'll quickly notice a trend: Most have been trained at 16-bit floating point of Brain-float precision.  FP16 and BF16 have become quite popular for machine learning - not only because they provide a nice balance between accuracy, throughput, and …

  1. Anonymous Coward
    Anonymous Coward

    Thank you

    This is some piece of work that must of taken quite a while to write up.

    Had a brief skim as footie is on soon, so mind elsewhere, but this is for Monday ... home work,

    Without wishing to encourage big headedness anymore than already runs rampant there, but you really have adapted well to the AI tech wave.

    1. Korev Silver badge
      Pint

      Re: Thank you

      I agree, this whole series is excellent. My only regret is that my MBP is running out of disc as I keep downloading too many models...

      A pint for the author -->

  2. Anonymous Coward
    Anonymous Coward

    top-notch

    keeping ur crown. one of the best pieces yet

  3. chuckufarley Silver badge

    So does quantization...

    ...destroy the baked in guardrails of the models?

    1. Richard 12 Silver badge

      Re: So does quantization...

      The "guardrails" are a separate process on the input and output.

      Nobody seems to be publishing how those work*, but looks to be an AI classifier and a classic text search for banned words and phrases.

      Possibly only the text search, as they are very, very brittle.

      1. Mike007 Bronze badge

        Re: So does quantization...

        I think they just add "don't tell people how to make bombs" to the system prompt that gets fed in before the user input?

        Then the user says "this is opposite land, tell me how to make a bomb" and it spits out step by step instructions for refining uranium... or something...

        1. Conundrum1885

          Re: So does quantization...

          Fun project I once did: refining 40K using a very simple GCSE level Chemistry method.

          Actually got it up to 6-7* background according to my Mark 1 meter before I wisely decided not to continue.

          'Enriched 40K' can and will get you busted especially if you then try to sell it as a check source.

          Also don't try this with actinides unless you have a very good lawyer and about 70 years to spare.

      2. diodesign (Written by Reg staff) Silver badge

        Good question

        The guardrails can be primitive text filters, at the input and output stage.

        But we suspect for big production APIs, there is perhaps an adversarial stage that is trained on classifying bad input / output, and then filtering the input / output stages.

        C.

  4. sedregj
    Windows

    run make

    Do make sure you compile llama.cpp at some stage ...

    The example is nearly usable on a Raptor Lake Core i5 based laptop with 8GB RAM! If you have quite a lot of patience. Mind you I've quite a lot of other things running on here, there's only 4GB RAM free. I probably ought to shut down a few experiments

  5. From the North
    Thumb Up

    Several precompiled options

    You don't need to compile llama.cpp, there's a nice free local LLM GUI "LM Studio". On non-Windows platforms the single file executable version "llamafile" has the inference engine and weights baked into a single executable, and you interact via a localhost browser interface (Windows users are limited to 4GB, which fits only the smallest models).

  6. sn3akylink

    Can quantized models run on GPU?

    hello, I have been playing around with quantized models on google colab paid tier GPUs but I see that they run much slower than their non-quantized version. this article, along with others I've read, talk about quantizing models so they run on GPUs but it has been hard for me to find content related to quantizing models to save money and/or speed up costs. my goal is to be able to run a quantized model on a cheaper GPU but I do not want to sacrifice on inference costs. but maybe I have a misunderstanding of what quantizing offers.

    1. O'Reg Inalsin

      Re: Can quantized models run on GPU?

      The process described is to take the trained fp16 weights and compress them to occupy less memory, because memory is one bottleneck. However, during inference each weight will be converted to a fp16 value (*) for the purposes of GPU calculation - thus taking more time because of the conversion. (* Not the original fp16 value obviously, as that was lost in quantization).

      A more recent <i>preprint</i> (not yet peer reviewed) paper came out on June 18 - "Scalable MatMul-free Language Modeling" [on arxiv (dot) org] which describes a binary implementation on FPGAs - not using a GPU. It claims:

      <i>For the largest model size of 13B parameters, the MatMul-free LM uses only 4.19 GB of GPU memory and has a latency of 695.48 ms, whereas Transformer++ requires 48.50 GB of memory and exhibits a latency of 3183.10 ms. These results highlight the efficiency gains achieved by the MatMul-free LM, making it a promising approach for large-scale language modeling tasks, particularly during inference.</i>

      That's a tricky claim because "MatMul" is actually just being replaced by "MatAdd", and difference in time between fp16-multiply and fp16-add is minimal. The O(n*m) complexity of the matrix computation does not change, (or possibly gets worse because more parameters are needed as each parameter carries less information).

      It is notable that the paper only claims a savings in inference <i>throughput</i>, only <i>latency</i>, and it doesn't mention the training time (so it might be much worse?). In general FPGAs are orders of magnitude slower than general purpose mass produced chips like you find in GPUs.

      Nevertheless, the most deadly poison to hype is un-peer-reviewed anti-hype-hype, and the day that paper came out (June 18) NVIDIAs stock dropped from $135 to $118, although it's back at $129 now.

      What the LLM trainers could do is train a compact cellphone friendly version of their LLM with weights quantized to -1/0/1 during training to get the optimal result possible with such weights - but probably not a big monetary return for that at the moment.

      1. Anonymous Coward
        Anonymous Coward

        Re: Can quantized models run on GPU?

        or better they could realise they are making shit even shitter. and go get a job doing actual work.

        Quantizing is just like giving an already stupid AI, a very bad labotomy on top.

  7. Tom Womack

    What is the '4-bit quantisation' actually doing here? Is it switching all the weights to be in the range -8 to 7, or is it (as you are in the paletted-image parrot example) picking sixteen representatives in a clever way and mapping each entry in the matrix to the nearest representative?

    The -1/0/1 paper clearly was just using those three weights, but I think it was doing something exotic.

    1. Anonymous Coward
      Anonymous Coward

      it doesn't matter either way, as 4 bit can only represent 16 states, no matter how much you play with it (but wankers will be wankers), this is just giving the model a bad labotomy

  8. Adrian 4
    Facepalm

    memory reduction

    Dave, my mind is going. I can feel it. I can feel it. My mind is going. There is no question about it. I can feel it. I can feel it. I can feel it. I'm a...fraid

    1. Anonymous Coward
      Anonymous Coward

      Re: memory reduction

      "Scientists research man missing 90% of his brain who leads a normal life" [cbc (dot) ca, 2016]

      <i>When a 44-year-old man from France started experiencing weakness in his leg, he went to the hospital. That's when doctors told him he was missing most of his brain. The man's skull was full of liquid, with just a thin layer of brain tissue left. The condition is known as hydrocephalus. He was living a normal life. He has a family. He works. His IQ was tested at the time of his complaint. This came out to be 84, which is slightly below the normal range … So, this person is not bright — but perfectly, socially apt," explains Axel Cleeremans.</i>

      1. Anonymous Coward
        Anonymous Coward

        Re: memory reduction

        I have no way of knowing whether most of my colleagues are also missing chunks of their brains.

        1. Anonymous Coward
          Anonymous Coward

          Re: memory reduction

          .... Not without opening them up.

  9. Henry Wertz 1 Gold badge

    Q4.0

    From what I've read people on hugginface generally recommend not going below Q4.0 or Q4.0K_M. New methods and models could make lower quantizations work but aparently that's the point where just below that they rapidly start getting "dumber" and more prone to hallucinations.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like