GTX 1060 6GB
I didn't know benchmark software could snicker. I guess we do have AI after all..
If you want to scale a large language model (LLM) to a few thousand users, you might think a beefy enterprise GPU is a hard requirement. However, at least according to Backprop, all you actually need is a four-year-old graphics card. In a recent post, the Estonian GPU cloud startup demonstrated how a single Nvidia RTX 3090, …
The Arc 770, that some folks seem to have at hand for much enjoyed Hands On-type of AI shenanigans (tutorials), looks to have perf in some way comparable to the 3090 (say 39 vs 36 TFLOPS in FP16) ... and maybe a slightly lower price point. Inquiring minds might relish seeing an upcoming "tuto"/PoC where Llama 3.1-8B is run at 1-10 concurrent requests on this Arch (and compared to the Estonian plot for RTX 3090 world domination) imho.
I guess part of the secret sauce here may be vLLM's use of continuous batching (up to 23x throughput improvement) ... a technique that likely inspired nVidia's H100 in-flight batching (if I read well).
Size the app to keep it all in main memory.
As good a piece of advice today as it was when designing for a Cray 1.
And let's just take a moment to consider that 142 terra FLOPS (albeit at 16bit accuracy) is not even SoA in 2024.
My instinct in this is the real trick is devising tools that can analyse numeric source code and map them into the most resolution preserving ways.
For example Pi is approximated by 22/7 (Error 4x 10^-4) but the lesser known 355/113 (error 8x 10^-8) has an error 10 000x smaller for one additional digit top and bottom. Which as a layman I think is pretty impressive, but the real trick is the algorithm that can be applied to any calculation to find those approximations.
I'll leave it there.
Them Lunar-ticks are crazy accurate for this, just a mad bin and loon away from the obsessive-compulsive Mars-Eniac standard, both major improvements over the torturously thin Neptune tood-le, pegged through Uranus with super-positry roulette timing (or so I'm told ... not an expert).
It's key tech to benchmark computational astronaut jobs ... amazing that these flops work at all ... can't wait for the coming of age of the Zitty-scale (after the earthy terra-, yummy pita-, and 6-sided hexa-scales)!