back to article Intel challenges AMD's Epycs with a 144 e-core Xeon

With the launch of its many-cored Xeon 6 processors at Computex on Tuesday, Intel is closer to reclaiming the core-count lead over competitors AMD and Ampere. With up to 288 efficiency cores, Intel's Sierra Forest CPUs can boast more hardware threads than the latest chips from either of its main rivals. Unfortunately, that 288 …

  1. CowHorseFrog Silver badge

    WIth all the chip space devoted to all the pipelining and logic to handle stalls etc, how many modest raspberry pis cpus would this treansistor count buy ?

    1. StargateSg7 Bronze badge

      Too many manufacturers are using Symmetrical Multiprocessing aka virtual cores aka hyperthreading when they should be using single-thread complex cores that contain branches for specific data types such as Floating Point, Fixed Point, Signed/Unsigned Integer, 16-bit UNICODE processing and 8-bit 256-state Boolean State values within each complex core. A hardware flag should signal the data type that need to be manipulated which then branches to the data-type-specific process pipeline and then passes a final result to an application-specific locked/secured cache or a multi-application shared global cache for further processing by downstream applications using the specified cache. This gets rid of pipeline stalls and no need to save states between virtualized threads triggered by an interrupt handler.

      In our case at NCA (North Canadian Aerospace), we have an in-house-designed-and-built 1024 core single-thread complex core SOC super-chip which has the above architecture BUT WITHOUT any fancy pipelining for hyperthreading. We have HUUUUUUGE CACHES for saving all the final results which can be scheduled to be garbage collected VIA HARDWARE ACCELERATION when the data's expiry time passes (i.e. for time-sensitive applications such as frame-by-frame video processing or sample-by-sample audio and metadata processing)

      We did have versions of the chip that had 8-way hyperhead-like processing which ended up being way too complex and kept data stalled too many times and the buffers always overrun with outdated data. After much design kerfuffles, we went back to single thread cores but enabled multi-path processing that were dependent on the initial data-type to be processed. If the Fixed Point process path is selected, the Integer processing path is NOT able to be used at the same time so that means the core cannot thread lock or have pipeline stalls since the results data cache for that core is so deep that we can keep processing until the expiry time passes and that cache location then gets recycled/garbage-collected to allow other data values to be set!

      Each single-thread core can have one million Floating, Point, Fixed Point, Integer, Boolean State or UNICODE value up to 128-bits wide stored in its LOCAL final results cache and each final result has an associated 16-bit error flag value, a 16-bit current data processing status flag value, a 16-bit user-defined application-specific status or data usage flag, a 16-bit user-defined content type or end-use flag (i.e. what application type and/or end-use is assigned to the specific FP/Fixed Point/Integer/Boolean/Unicode results value -- can be sub-divided into two 8-bit values or four 4-bit values?) and a 64-bit user-defined expiry time flag that defines the number of picoseconds from the start-of-processing that the data value expires (i.e. 18,446,744 seconds or 5124 hours or 213 days maximum time limit before cached individual data value expires and is automatically recycled). If the initial flag of the operation is to place the final result into a global shared cache, the final result is placed into the global shared cache instead of the local cache! If the local data cache or global shared cache data value stays unexpired it is NOT moved until it is retrieved by an application or its expiry time is met. If all one-million local cache or all global shared cache final result values are unexpired and not yet retrieved, a master semaphore is sent upstream to the controlling application indicating the local cache or global shared cache is full and to clear all or parts of the cache based upon data type/content flags or sort by the closest expiry date! That is the ONLY TIME the processing pipeline can stall -- i.e. when the local cache or global shared cache is full of unexpired items!

      The input operands cache for each core is up to 65536 items deep (up to 128-bits wide) so that real-time convolution filters and small segments of very large arrays can be easily facilitated without have to go off-core into system heap memory or Hard Drive space and each operand ALSO HAS a 16-bit error flag, 16-bit current status flag, 16-bit content type flag, a 16-bit operand how-to-process flag and a 64-bit user-defined application-specific usage, security or processing flag that can be sub-divided into multiple 8-bit or 16-bit sub-flags. Each operand flag indicates how, where, when, who and what is to be done with the operand. All major bitwise (and/or/xor/not/spin bits/shift left/shift right/set-bits-on/set-bits-off) and all major multiple/add/subtract/divide/root/power/modulo/int-portion-only/fractional-portion-only/boolean comparisons and value-specific search processing is built-in and hardware accelerated in each core!

      We have found this method to have the best performance for real-time multimedia-centric and metadata processing-specific applications!

      P.S. The above CPU processor multi-core processing method is now worldwide fully free and open source under GPL-3 licence terms!

      V

      1. CowHorseFrog Silver badge

        I tried to keep my text simple, but as you expanded theres a lot of supporting components that support all the complexity of having such a complex cpu core.

        I cant help but wonder if things were simpler, you could get 10x or maybe 100x more simple cores, and these simple cores even if they were at half the clock speed, would get a shitload more work done.

        Just imagine how many 6502 you could get ? Prolly close to a million 6502 at 1G would easily kill this monster and it would use less power and less transitors so would be cheaper.

        1. StargateSg7 Bronze badge

          I actually remember somebody at DEC (Digital Equipment Corporation) of Maynard, Mass. that made supercomputers out of 6502 cpus. I can't remember if it was a Connection Machine or MasPar but some employee of DEC did that by connecting one million 6502 CPUs to form a supercomputer. I think it may have been a sub-contracted part of the intel headed RISC i860 powered Delta Touchstone massively parallel supercomputer systems which were sent to the NSA and the Department of Energy (i.e. Los Alamos) I think the USAF got one too for designing the hull and engines of "The Green Lady" ZIP-fuel-powered (i.e. Boranes) hypersonic secret recon aircraft.

          The Green Lady is STILL secret even though it was flying operationally by 1996/1997 and in initial testing by 1987/1988 under various hull designs. It's a Beautiful Aircraft! I have a few photos of the Green Lady on my wall and it now flies out of Dryden and Diego Garcia after being disposed-of/off-loaded by the USAF/NRO back to the CIA for their own purposes. "Spearfisher" is the latest secret hypersonic replacement for the Green Lady operational since 2015 (i think it was 2015 --- it could be 2017!) The latest iteration of the Pumpkin Seed TSTO spaceplane and the older SR-75 Valkyrie-like Carrier Craft + Parasite Spaceplane WERE ALSO designed on early massively parallel supercomputers using huge numbers of i860 CPUs and other parallel RISC-based supercomputers.

          The 6502 supercomputers were ALSO USED for simulating and creating early gene sequencing and gene editing technology that was done in upstate New York and in New Jersey in the 1980's/1990's as part of some US ARMY-initiated animal hybridization programs to create better/more-intelligent "Guard Dogs" and "Pack Animals" that were to be used for special forces personnel and for active base protection. There were some VERY INTERESTING and basically WEIRD results from those gene-editing/animal-hybridization programs!

          The U.S. Department of Defense was the largest purchaser of early massive-parallel supercomputers that used COTS (Common Off The Shelf) CPUs and DSP chips in the 1990's!

          P.S. I remember the NSA/DOE 1980's/1990's supercomputer programs as Delta Touchstone and NOT Touchstone Delta! So There! That's another Mandela Effect at play here!

          V

  2. Neil Barnes Silver badge

    288 cores?

    "That's two gross!"

    "Nothing's too gross for this industry!"

    (Wish I could remember the film that supplied (almost) that quote... Porkies?)

  3. Anonymous Coward
    Anonymous Coward

    that many cores will cost for SW licencing

    I get the idea of e cores, but for mainstream workloads I can see an issue.

    Until various software vendors charge a differential price (lower) for e cores vs "full fat" ones, I can't imagine that the power savings from efficiency will actually yield a net saving for a customer.

    Can Intel make that happen? In the old "wintel" days I think they likely could have, but now, probably not....

    1. Geoff Campbell Silver badge
      Pirate

      Re: that many cores will cost for SW licencing

      True, but there's a great deal goes on in data centres using FOSS software to do heavily parallel tasks. Web and email services, as a simple example.

      GJC

  4. Zibob Silver badge

    Maybe

    I have heard this exact same story from Intel before with the manycore pentium 144 core parts that didn't actually end up being made.

    Sounds like the dusted off old notes.

    Although it looks like their mudding of the waters worked. 8 cannot find the news about it now, it was maybe 10 years ago, now you only get news about the new one.

  5. Henry Wertz 1 Gold badge

    Ughh... e-cores

    I'm not impressed, at all, by these Atom-based cores. One thing worth noting -- I expect this "Oh, it's a bit faster than this 64-core Xeon" is using all 144 cores, not per core performance. That'd be about in line with the very sluggish performance I've seen from the Atom-based parts over the years ("Celeron N" and so on.)

    Intel has run into this problem before and in fact it's where the Atom originated from -- they had reasonably fast CPUs but wanted a lower power part to compete with ARMs. So they got one, but it was still somewhat higher power than the ARM while being far slower. Sounds like the same thing now -- they want to compete with the many-core ARMs (and AMD as well), so they now have this many-core CPU, and are hoping people just overlook that each core is slow as all hell. It's not like those E-Cores are completely useless -- having some lower power cores to keep things ticking over under low load is great, as widely used in the ARM big.little setups. But I sure wouldn't want a CPU that is only E-Cores.

    1. Bitsminer Silver badge

      Re: Ughh... e-cores

      There used to be the argument that Java-based server codes used "efficiency" cores better than "performance" cores.

      The old SPARC with many many threads per core was put forward as an example. Can't buy those today.

      I think the concern with 144 cores is that memory speed has only nearly doubled to a few thousand transfers/sec, and the channel count (with Intel) has only doubled.

      So how do you feed all these cores from relatively low-powered memory? Are they memory-starved?

      1. CowHorseFrog Silver badge

        Re: Ughh... e-cores

        Have different banks of memory for different cpu cores, rather than a shared global memory.

        Its the old separate cpu AND graphics memory

  6. Dadz

    Xeon Phi

    This reminds me of the Xeon Phi (2010-2020) which had up to 72 Silvermont cores on a Xeon.

    The Knight's Landing version was heavily optimized for HPC with out-of-order vector processing, but the rest of the core was similar to an Atom.

    https://chipsandcheese.com/2022/12/08/knights-landing-atom-with-avx-512/

    https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/

    The Gracemont core is similar to an original Intel Core processor.

    Note that Knight's Landing version had AVX-512 tacked on, which is not found in this new Sierra Forest chip.

    But Knight's Landing is for HPC, whereas Sierra Forest is for cloud workloads.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like