Alrighty then..........!!!!
Our mega-monster-size supercomputer is a 119 ExaFLOPS SUSTAINED using cpu dies running at 128 bits wide at 60 GHz GaAs for Signed and UnSigned Integer, Floating Point, Fixed Point numbers and RGBA/YCbCrA/HSLA pixel types processing. This is a CISC computing device. (i.e. general purpose combined CPU/GPU/DSP Complex Instruction Set Computing)
The Vector/Array Processor dies are much much simpler and are RISC (Reduced Instruction Set Computing) devices and run at 2 THz (Two TeraHertz) as their cores are much smaller and simpler.
We are just now combining the two devices into a single combined chip die where one CISC cpu side runs at 60 GHz and the other side is a 2 THz Vector/Array processor that has 65536 by 65536 mini-RISC-cores at 128-bits wide that work SIMULTANEOUS in a synchronized fashion for the data types of Int/FP/FXP/RGBA/YCbCrA/HSLA. This vector/array processor part of the chip has a series of named internal registers and SRAM-like caches assigned to each mini-core at the full 2 THz!
The RISC array processor side does only very simple Integer and Real number tasks such as Add, Subtract, Multiply, Divide, Root, Square, Cube, Square Root, Nth Root, PowerOf, and then the bitwise tasks such as SHR, SHL, AND, OR, XOR, NOT, ROTATE BITS, SWAP BITS, REVERSE, MOVE, COPY bits AND a hardware-based 2D-XY up-to-16x16-value convolution filter and a 3D-XYZ 16x16x16 value convolution filter. There is no super-pipelining or advanced branch prediction or hyperthreading! Each core does only one task or operation at a time in serial that processes only from One to 256 data values (i.e. an up to 16x16 2D-XY convolution filter) However each set of cores in a processing block is synchronized with its neighbouring mini-cores ensuring that ALL data values get processed and finished at the same time! This is somewhat similar to SIMD-like (Single Instruction, Multiple Data) vector instructions used on common GPU's.
This part of the die runs at the FULL Two Terahertz on the data registers and convolution data AND accesses the shared memory cache ALSO at 2 THz by temporarily locking the shared sram-like cache block and setting the bits in that locked memory block at the full 2 THZ clock rate. The CISC side cannot use that memory block until the RISC vector/array processor unlocks it. And when the RISC side unlocks the shared memory block, the CISC unit can lock and access the data block at it's own internal 60 GHz clock rate.
There is a variable speed cross-bridge where each data processing die put it's final data result from each side's own internal cache memory and data registers into a larger SHARED RAM memory cache at their OWN internal clock speed (i.e. 60 GHz or 2 THz)
BOTH sides can simultaneously access different portions of the shared memory cache using a lock/unlock memory block semaphore infrastructure at their own clock speed. That on-chip shared memory cache size is in the terabytes range!
We ALSO added a Vector instruction set that assign the array processor side as a single 64k by 64k cores synchronous processing block, or as four 32k by 32k cores processing blocks, or as sixteen of 16k by 16k cores processing blocks down to many 1k by 1k cores processing blocks which can be assigned to separate tasks BUT PROCESSING BLOCKS of multiple-cores will run all those set-of-cores simultaneously.
This makes for easy-to-create and manage synchrononized video/audio/DSP processing tasks that require common array lengths of common data types to have simple math operations done on all specified values in an array ALL at ONCE! The synchronization can be such that ONLY when block-based processing task has finished putting ALL it's results in the final results output array, will another processing block access those results as inputs for it's own processing task. This makes it EASY to create multiple layers of audio/video filters and effects that finish processing an ENTIRE block of data in a KNOWN amount of time that is in the mere nanoseconds range! This allows for syncing and playback/recording at common video frames rates and/or audio sample rates even when multiple filters and effects are applied to each video frame/audio sample set or applied to multiple groups of video frames/audio sample sets!
Initial testing has shown the COMBINED processing power is 1.2 PetaFLOPS per chip which means I only need 167 of the combined CISC/RISC cpu dies to equal the 200 PetaFlops of the SUMMIT supercomputer! Right now we have a 119 ExaFLOPS monster which has a SEPARATE rack system for the 60 GHz CISC CPU's and a separate rack system for the 2 THz RISC-based Vector/Array processors.
Now we are COMBINING each chip type onto a single die and EMBEDDING thermal transfer fluid microchannel-based cooling INTO the die itself for maximum heat-wicking capability. We are ALSO embedding multiple Dense Wave Optical Interface ports right onto the die so that each combined chip has DIRECT access to neighbouring CPU chips AND there are multiple pass-through optical transfer lanes so that backbone-type networking is no longer needed and we can organize the resulting supercomputer very much like the human brain as a cross-linked-to-nearest-neighbour-chips optical network topology.
This ALSO MEANS there is no more of a rats nest of cables, since we organize each motherboard as processing units of 8 x 8 combined CPU/GPU/DSP/Vector chips that have the optical pathways etched right into the motherboard which cross-links ALL 64 CPU's on each motherboard together much like neurons AND allowing for a higher-level board-to-board cross-link using short dense-wave fibre cables for board-to-board communications to their nearest motherboard neighbors.
Since each chip has it's own on-chip terabytes-sized cache/working memory and has access to a SHARED on-motherboard battery-backed system very-large-sized RAM block, that means each CPU chip and motherboard have their OWN RAM-based storage media for ultimate data storage speed! Only when data is finally finished processing on each single-chip and/or via the group-based 8-by-8-chip-shared-motherboard processing, does results data get transferred out through the bypass/pass-through optical network lanes to the cheaper and/or slower larger external SSD storage arrays.
With the etching times needed for such wide-trace-lines of the GaAs substrate process (i.e. a minimum of 280 nm wide circuit lines!) we are looking at a 10+ day window to etch all the traces on each combined CPU chip using a multi-beam etcher. BUT since we now have more than a few thousands of those etchers, we can do a around 30 thousand such chips a month. By late 2020 we will have the world's FIRST ZettaFLOP supercomputer!
AND since the first 119 ExaFLOPS supercomputer is already human+ equivalence in terms of general intelligence because it's running a physics-based molecular/electrical functional simulation of the K/Na/P/etc gating done in human neurons, a ZettaFLOP supercomputer would allow us to model ALL electrical gating of all human neural tissue, so we will LIKELY get a self-evolving super-intelligence (200+ IQ) within a few weeks of its initial training/teaching!
.
Hooooooooyaaaaahhhhh !!!!
.
Bring on the super-CPU's bay-beeeee!
.
CAN YOU DIG IT ????
.