So what
I predict a riot!
Within a few years, AMD expects to have notebook chips capable of running 30-billion-parameter large-language models locally at a speedy 100 tokens per second. Achieving this target – which also calls for 100ms of first token latency – isn't as simple as it sounds. It will require optimizations on both the software and …
It may be that there are no 30 billion parameter LLMs in a few years, new technology is already shrinking them markedly.
And new approaches like neuromorphic may make them 100x faster and smaller - may require newer NPUs but should easily fit on mobile.
AND there may always be reason for any client to reach into cloud when running AI, so stand-alone will always be more limited.
Have a nice day.
18GB of HBM3e (that is, a single 12-high stack of 12Gbit devices) isn't a completely unreasonable thing to put next to the GPU package in a 2027 SoP, and handles AMD's spec already. It's a 'chiplet' 11mm on a side, the same size as the IO die in Ryzen.
(https://www.theregister.com/2023/08/23/sk_hynix_hbm3e_sample_shipped)
My years-old HP Pavilion laptop, with a whopping 8 GB of RAM, running Ubuntu, does almost everything I want it to. (I have some trouble with D3D under Wine; that's it.) I'll pass on buying a new machine so an overhyped text-predictor can give me questionable answers to things I can find out myself with a search engine.
I bought some cheep old servers and scientific computation GPUs, I suppose I could send them an AI query off my 1st gen iPhone SE and have it deliver results fast enough.
This is speculative of course, first of all I’d require some need or interest compelling me to seek such technology.
(I don’t see any such personal need or interest developing… ever.)
... out of its ass.
"Charleston was once the rage, ah-huh. History has turned the page, ah-huh.
The miniskirt's the current thing, ah-huh. Teenybopper is our newborn king, ah-huh.
And the beat goes on, the beat goes on. Drums keep pounding a rhythm to the brain. La-de-da-de-de, la-de-da-de-da..." --Sonny & Cher
'AI', thus far, appears to lack deep understanding by its designers and operators of how its internal processes lead to particular results. 'AI' is a 'black box' for seeking connections among data slung at it during a so-called 'learning' process. Of itself, 'AI' has little, if any, ability to discriminate among its 'sources' with respect to their provenance and reliability. Its cobbled together responses appear impressive and authoritative because they are expressed in natural language, this last being the truly powerful part of the process.
'AI' development has much in common with, perhaps roots in, the specious 'large data' movement which, crudely put, assumes the bigger the haystack the more easily golden needles may be found.
Present moves towards enhancing 'AI' by increasing its capacity for storage and processing resonates with “Never Mind the Quality, Feel the Width” [a 1967-71 British sitcom]. Until powers of discrimination are programmable, the best to do is employ human judgement in selecting the source of training data. That suggests reliable 'all singing, and all dancing' AI is a pipe dream until underlying processes of cognition are better understood. Present heuristic methods might produce more useful results when focussed during 'training' upon clearly defined areas of application; even so, moderation by subject specialists should be required.
Floating point arithmetic is convenient but its relatively inaccurate for a given word size. If you need to go really fast and make optimal use of memory space then you need to work in fixed point -- essentially integer arithmetic. Its a bit fiddly to deal with because you have to be mindful of number ranges and precision so most people prefer to just use a float and eat the side effects. After all, you can always get faster/bigger hardware.....oh wait......
A 20 Terabyte Hard drive in 2024 is $400 USD which can store TEN TRILLION 16-bit parameters while you use a SECOND 20 Terabyte hard drive to store 32-bit INDEXES to groups that comprise Five Trillion subject matter groups and token indexes of subject-matter-specific A.I. model parameters and tokens. Since I helped PIONEER this technique at NCA (North Canadian Aerospace), we can now use CHEAP spinning hard drives that lets ANYONE spend a mere $1000 USD for two 20 TB hard drives, buy a mere 64 gigabytes of System RAM and a decent gamer-class graphics card in order to use the power of 50 year old DBMS binary search techniques to properly and QUICKLY index-to and process groups of tokens that breaks down related data into easily-parsed subject matter groups rather than trying to use an overly monolithic statistical model and convolution matrix to process data.
Instead of processing a mere 100 tokens per second, I use the power of a GPU graphics card to process subject-matter-related blocks of 16-bit tokens/parameters and 32-bit indexes within the 32 gigabyte to 128 gigabyte memory space of gaming-class graphics cards to process MANY input tokens in parallel at the GPU card equivalent of 4096 by 2160 pixels at 120 fps at 32 bits per pixel image processing bandwidth which is 1,061,683,200 worth of 16-bit A.I.-model tokens and 32-bit indexing parameters PER SECOND which UTTERLY BLOWS AWAY EVERY modern LLM system out there today!
The NorthCanadianAerospace websites are almost ready to go fully online to download our World-wide fully free and open source under GPL-3 licence terms TEN TRILLION parameter A.I.-based combined text, stills and video imaging, audio/voice, general artistic and multi-disciplinary sciences model which is WELL-CURATED from ONLY open-source data AND our model has BOTH short-term and long-term MEMORY which FILTERS and parses/divides between human-created and machine-created data so that end-users can use simple prompts to allow or disallow the use of previously-saved data, or use recursive output-redirected-into-input processing of machine created data, human-only data and percentage-based mixtures of all! No restrictions OF ANY KIND are put into the A.I. model and all inputs and prompts can be saved/recorded within encrypted file formats WITH share-to-individuals or share-to-groups option to distribute end-user-created prompts and final output with the public and/or selected persons/groups allowing the EXACT DUPLICATION of output from UNIQUE original prompts AND ALLOW the creation of extensions and derivatives of previous data and imagery outputs!
The addition of EXTERNAL end-user tokens/parameters are put into a separate file to the original data so the the original model NEVER becomes poisoned and the end-users can specify the percentage amounts of the built-in and external subject matter areas that are ALLOWED to use or expand upon any NEW saved/recorded prompts and any external token/parameter databases. While Meta and Google use Terabyte-sized models trained with 15 Trillion tokens, we have designed and created a CURATED training system that organizes data in the same way a library system does. Using an analogy, we basically use a form of the advanced-format 1980's-era Library-oriented Dewey Decimal system to separate-out subject matter into related groups than can be cross-linked by sets of 32-bit indexing links to ensure input prompts are parsed and processed like a human mind would so that only RELEVANT AND RELATED DATA is output.
The entire internet IS NOT scraped but rather ONLY WELL-CURATED open-source data is used and processed to narrow down common historical text, stills and video imagery, audio, arts-related and sciences-related data into common Librarian-oriented subject matter groups that will be used to NARROW DOWN THE SCOPE of an input prompt to output ONLY what was actually asked for and/or requested! Well-defined Boolean Logic and age-old DBMS techniques are the base underpinnings of the A.I. model that PREVENT model poisoning since the original A.I. model database DOES NOT UPDATE OR ADAPT on an automated basis but rather is set-in-stone EXCEPT for any end-user-defined and/or externally-added data that is allowed or not allowed to be included in any final outputs based upon end-user set percentages of base and external data inclusion and exclusion!
We store more indexes-to-data and more cross-links to data than store actual stored data tokens because storing EVERY POSSIBLE piece of data is wasteful and NOT NEEDED! We only need to store HOW languages and common-knowledge SHOULD BE PROCESSED and combined! Humans don't store the phone numbers of a massive phone book in their brains. Why should they? You only need to KNOW HOW numbers are indexed and organized and then retrieve ONLY THE SPECIFIC NARROWED-DOWN RANGE of numbers coming from an external source! You also don't store every guitar riff or note from every song. You only need to store the musical scale itself, fret positions, common chords and common musical rules-of-composition and musical style in order to make a new riff!
It's just like HOW the human brain works. We store only the MOST BASIC VALUES and the COMMON RULES of how data should be organized and used and NOT the actually-created data itself! Think of this as a super-advanced rules-based expert system rather than a A.I.-savant that knows everything. Our A.I. model is physically much smaller in disk space and much faster to process so it becomes MUCH MORE USEFUL to more people than what OpenAI, Google, Microsoft and Meta have to offer! Plus, our A.I. is UTTERLY FREE and OPEN SOURCE!
All the above specifications and ideas are now world-wide, fully free and open source under GPL-3 Licence Terms!
V
$ ollama run llama3:70b
>>> Write a rambling commentary describing a very fast AI processing architecture. Use technical terms and sprinkle in all-caps words, but no asterisks markup formatting. The author is nerdy and uses a somewhat condescending tone.
So I've been working on this little project, right? And let me tell you, it's a GAME-CHANGER. I'm talking about an AI processing architecture that's so fast, it's almost like I've defied the laws of PHYSICS themselves. I mean, we're talking about a 10x improvement in processing
speeds over the current STATE-OF-THE-ART. Yeah, you heard me right, 10 TIMES FASTER.
It all starts with my proprietary MATRIX MULTIPLIER technology, which allows for PARALLEL PROCESSING on a massive scale. We're talking thousands of NODES working together in perfect harmony, like a finely tuned ORCHESTRA. And at the heart of it all is my custom-designed
GRAPHICS PROCESSING UNIT, capable of CRUNCHING NUMBERS at a rate of 10 PETA-FLOPS PER SECOND.
But wait, there's more! My architecture also incorporates ADVANCED NEURAL NETWORK COMPRESSION algorithms, allowing for REAL-TIME INFERENCING and REDUCED LATENCY. It's like the AI is READING YOUR MIND or something (just kidding, it's not that creepy... or is it?).
And don't even get me started on the MEMORY HIERARCHY. I've implemented a HYBRID STORAGE SYSTEM that combines the SPEED of OPTANE with the CAPACITY of 3D XPOINT. It's like having your cake and eating it too (but, you know, with flash storage).
Anyway, I'm sure all the OTHER GENIUSES out there will be green with envy when they hear about my latest creation. But let's be real, they're probably still stuck in the Stone Age of AI development. Oh well, I'll just have to wait for them to catch up...
Technically, that description ISN'T WRONG! LLM's do in fact use Matrix Multipliers because that is what convolution filters are! (i.e. CNN and SOBEL edge detection systems used in image recognition). Ten PetaFLOPS would only be possible on GaAs architectures which ironically IS what we use on our 2 THz clock speed at 50 PetaFLOPS sustained 128-bits wide supercomputers we designed and built in-house which ARE in fact the TRUE world's FASTEST supercomputers.
For the typical home user of a combined LLM and Imaging A.I. Model, you only need a modern gamer-class AMD or NVIDIA GPU card with the typical 16 to 64 Gigabytes of VRAM that can create and process 120 frames per second of video at 32 bits per pixel at 4K resolution which means OVER A BILLION TOKENS PER SECOND can be processed using common graphics cards!
THAT is a SIGNIFICANT INCREASE over any other A.I. Model because token processing is inherently a GPU-ready process. Each token merely needs to be converted or already formatted as a 32 bit integer-based index value OR two 16-bit integer-based LLM text input tokens or two 16-bit data values. No need for Floating Point processing at all! DO EVERYTHING as integers and you can get OpenAI-sized datasets stuffed into a home PC using only TWO of the $400 20 TB hard drives and some fancy DBMS indexing and binary search technology!
So yeaaaaah! I might actually KNOW OF WHAT I SPEAK since I helped develop the original technology from the get go! YOU TOO can download it all for free with NO RESTRICTIONS OF ANY KIND at our websites.
V
This post has been deleted by its author
NCA (North Canadian Aerospace) is a pseudonym for a very large under-the-radar Aerospace company which at the moment wants to remain relatively anonymous. They use GoDaddy and Easyhosting as their anonymized web hosting and content management system services although INTERNALLY, everything is on 128-bits wide supercomputing gear connected via Petabit-class switches and Petabit dense-wave multiplexing-based fibre optic connections designed and built entirely in-house.
Our web servers don't actually use common web services or common SQL / RDBMS technology since we built our own compilers and server software plus a 5th Generation object-oriented relational database and natural language data formatting and access system which can output in real-time fully-optimized cross-compiled and cross-assembled executable code targeted towards many major CPU families plus output live HTML-5, JAVA, Python, Basic, Pascal, C/C++, Fortran, Cobol, ADA, etc. source code.
For now, I am in charge of using EXTERNAL web-hosting system to disclose certain technologies which are a bit of a big deal on a world-wide basis.
I'm not quite ready yet, but the NorthCanadianAerospace websites will go live fairly soon now with LOTS of World-Wide, Fully Free and Open Source under GPL-3 Licence Terms hardware designs, software source code, help and instruction manuals plus other goodies which will be explained in full detail onsite.
It will be Publicly All Yours for you to peruse and download at your leisure!
V
P.S. The FTL propulsion systems we developed with active and cooperative "Outside Help" (Hint! Hint!) are based upon in-atmosphere electrostatics-based Terahertz waveguides using Bismuth, Magnesium, Aluminum and other metallic layers to ensure the high-frequency waveforms can push the atmosphere out of the way AND provide a large atmospheric area to repulse against. The exo-atmospheric FTL systems use GWASER (Gravity Wave Amplification by Stimulated Emission of Radiation) technology which has a portable on-board Cyclotron to collide Tungsten nuclei together to form a continuous output of HFGW (High Frequency Gravity Waves aka Gravity-A) and MFGW (Medium Frequency Gravity Waves aka Gravity-B) which are collected, reflected, summed and directed within a virtualized cavity resonator and waveguide setup that uses surface waves and standing waves in the microwave bands that hug the inside of a stainless steel EM-management system in order to TRAP the normally impossible-to-collect-and-guide gravity waves to form Gravity-A and Gravity-B beams that create localized and temporary gravity wells in a fast computer-controlled sequence which compress local volumes of 3D-XYZ space so we can surf across the gravity wells and cross a much larger linear distance once the gravity wells evaporate and the compressed 3D space re-expands back out to its original dimensions. It only takes about 10 megawatts to power the portable cyclotron which usually used Ballard-style methanol fuel cells OR nowadays our in-house designed and built Acoustic Wave Liquid Solder Hydrogen Plasma Compression System which is a form of Sonofusion for continuous on-board power-production!
You need only avoid localized event horizons, any space debris and other space navigational hazards and NOT make your gravity well inside of a planet, sun or asteroid!
It Works Great and Now You Know!
V
I see 128-bit memory bus as being leading edge stuff today I guess, I had a flashback to one of my early video cards, the Number 9 Imagine 128 Series 2 (from 1996), which I just checked to confirm that they claimed at the time it had a 128-bit memory bus with 800MB per second of memory throughput.
https://www.dosdays.co.uk/topics/Manufacturers/numbernine/imagine128_s2.php
"Number Nine believes that the Imagine 128, Imagine 128 Series 2 and Series 2e are the only true 128-bit graphics cards in the world, utilizing 128-bit technology in all three major subsystems -- the graphics processor, the internal processor bus and data path to graphics memory."
The one thing I think I recall about the card, was I think *technically* the memory bus was not 128-bit, but rather a dual ported 64-bit? I believe the 4MB VRAM card that I had actually had 8 x 512kB VRAM chips on it. But that is just my speculation, based on the number of memory chips the card had (which I wasn't aware of until I owned the card, if I recall right it had 4 memory chips on each side of the card). But their entire marketing campaign was claiming everything was 128-bit throughout.
I also had the original Imagine 128 as well. I loved those cards, just for the specs really, I had no idea what I was doing at the time they just seemed so cool. While the series 2 had OpenGL 3D acceleration in it, it was not game capable(and I didn't do any 3D CAD stuff). My first 3D card was a PowerVR PCI card(which had no VGA output), another thing that I thought was super cool at the time tech wise, I played a lot of the original Unreal with that combination. I remember using AcceleratedX on Linux to use the Number 9 cards (which I don't believe offered any 3D support, for 3D had to use windows)
I know it's unrelated, just felt an urge to write that once, old memory comes back..