1 bit?
I'm waiting for the 0 bit model. It'll outperform all the others significantly in terms of resource usage...
PrismML, an AI venture out of Caltech, has released a 1-bit large language model that outperforms weightier models, with the expectation that it will improve AI efficiency and viability on mobile devices, among other applications. The model, dubbed Bonsai 8B, manages to be small and fast, with modest power demands and …
> Is that signed or unsigned?
The article answers this: the weights are +1/-1 with a shared scale (exponent), though it doesn't say how many weights share the same scale, or how big the scale factor is. But one could imagine something like a 32-bit value containing 24 1-bit weights (±1) together with a shared eight-bit exponent. Of course that would limit the available values - each weight would have to be ±2ⁿ, and n would have to be the same for the whole group, but I guess that's good enough for this kind of model.
Yeah, the gory details are in the "white paper [PDF]" (TFA link) where their Q1_0_g128 1-bit Format is described as storing "one sign bit per weight and one shared FP16 scale for each group of 128 weights". They also note that "1-bit Bonsai 8B is built from [Alibaba] Qwen3-8B".
Interestingly, the "1-bit Hardware" section of the "model, dubbed Bonsai 8B" (TFA link) notes that the reported performance "gains come primarily from the reduced memory footprint of 1-bit models, not yet from fully exploiting the 1-bit structure of the weights during inference", and, wrt future hardware "1-bit weights make it possible to perform inference with little or no multiplication, replacing much of the computation with simple additions" -- which should be a great thing in this specialized space.
Their high density regime slims down hefty models of languagerie without effective weight loss, which is nice. But they obviously don't answer the age-old question of what kind of actually useful stuff (rather than fashionable oopla) these (now fat-free) talkative portly models are good for, especially if they don't sport the latest in antigravity cleavage-enhancing harnesses and doomsday YOLO claws. I mean, couldn't the procedural skills cantilevered by such girdles be just as well showcased without a corpulent underlying model in the first place (whether rotund or skim)?! And if not, why not?
time to get out my slide rule and make sure the "shared factor" is relatively logarithmic to expand the range [if not already]. Power of 2 as an exponent might also be a good addition, i.e. one additional byte for 2^+/-127 .
A 16 or 32-bit to 8 bit "log" lookup could pretty fast. I assume they're not doing something like this already...
/me has done something *like* this before in a microcontroller experimental project...
this is the logical end point isn’t it? soon enough improvements at the cutting edge of the slopularity will matter less for the average slop monkey and super optimised, edge device models like these will receive filtered down improvements from previous generations. plus, on-device offers inherent efficiency and privacy advantages. that ought to spell trouble for the funding case for hyperscalers.
i can imagine AI bros insistently sticking to bleeding edge SaaS LLMs but the moment on device becomes good enough for everyone, the big firms will have lost the market.
You can draw the parallels to manufacturing:
Early manufacturing had huge machines with low efficiency and limited usability.
Over time, machines got better, more versatile .. and then then CNC revolution happened.
Right now you can already get CNC machines at the home level now (micro mill/lathe), to say nothing of how 3d printing has revolutionized DIY.
It's likely that eventually this will happen to LLM's as well.
Will it be a better world ? I have no clue. I can't (and won't) judge -- but it'll be interesting to watch.
unless i’m quite mistaken about this, we are already at the point where this is possible using newer generation special purpose distilled models. there’s a raft of apps that just wrap a gui and some workflow integrations around local text recognition.
This post has been deleted by its author
I asked myself what the ... "intelligence density" was when I first encountered it in the article. I decided to read on. "[A] metric that shows its models in a good light" provided a good answer.
[Speaking of metrics, tokens per second seems to belong to a disjoint category, especially as pricing tends to be measured in dollars per token or similar nowadays.]
I am happy to note that the model will fit into the RAM of my oldest still working computer (and it is old). I'd give it a spin, but will it give me a 1-bit answer (yes/no) where appropriate?
I'll have to see if I can find their paper anywhere; this is an interesting idea, especially if it performs as well as they claim. In particular, I'm wondering whether it would make it feasible to run one of the currently-huge LLMs that take up around 100GB of VRAM to run sanely on a 12GB consumer card. If they could accomplish that, it would yank the rug out from the big model companies and see to it that the likes of OpenAI, Microsoft, and others trying to hoover up all your data, modelling, and architectural approaches to your project by letting you do everything locally. That leak of Anthropic code should be a real eye opener to everyone as to just how insanely greedy those "businesses" are.
you cannot blame them for wanting to earn back what it costs to develop the tech...
However the PC revolution is a history of a pendulum swing between 'heavy client / light server' and 'heavy server / light client'. TODAY the AI is "in the cloud". When it's "on the LAN" or "on the PC/phone" we will have those fully autonomous robots and devices that understand natural language [even with accents] that you see in sci-fi. C3PO could be your next appliance.