How long can it be before this approach gets extended with a memory controller for off-SIP memory so that the high bandwidth on-SIP memory is just another layer of cache? One year? Two? In the lab now?
How Apple's M1 uses high-bandwidth memory to run like the clappers
Apple last week set the cat among Intel's pigeons with the launch of its first PCs incorporating silicon designed in-house. The company claims its M1 Arm chip delivers up to 3.5x faster CPU performance, up to 6x faster GPU performance, up to 15x faster machine learning, and up to 2x longer battery life than previous-generation …
COMMENTS
-
-
Thursday 19th November 2020 13:35 GMT Malcolm 1
Isn't this what AMD are already doing with their "Infinity Cache" - currently only 128MB on their latest GPUs but you could easily see how that could be expanded (manufacturing tech permitting).
Various of their previous GPUs have featured HBM but I gather it was prohibitively expensive and now only appears on their datacenter/workstation products.
-
Thursday 19th November 2020 13:45 GMT Anonymous Coward
HBM is a DDR memory technology with the associated high latency so not suited for a L4 cache. But such an external memory controller could well be useful for Optane type memory holding the operating system's page file.
Bandwidth and latency for an HBM2e pseudo-channel are roughly comparable with DDR5-5600 (32-bit wide). So you can think of an HBM stack as having 8x the performance of a DDR5 DIMM (though it has 16 effectively independent pseudo-channels). Currently HBM devices are available in up to 8GB capacity, but like all things DRAM that will inevitable increase with time.
-
-
-
Thursday 19th November 2020 16:36 GMT TimMaher
Mac pro
My 2010 was a genuine “Trigger’s broom”. Third vid., second set of full RAM, better than Apple RAID controller driving 12TB across four drives in 0/1 configuration , hybrid system drive in second DVD slot.... etc.
Recently moved on up to a 2012 as I needed High Sierra support without using a VM.
-
Friday 20th November 2020 20:15 GMT schermer
And before that the MacPPC's (tower G5). I had one of those. ATM a MacPro 2009 fully upgraded (bought it new in 2009). Since I am pensioned it is just overkill. So I bought this new MacMini M1 (fully specced): it will be delivered next week and will be more than enough for my current needs.
-
-
-
-
Thursday 19th November 2020 16:14 GMT Charlie Clark
You haven't been able to do anything with the notebooks for more than five years now. It is annoying. However, apart from being able to run more and more VMs at once, RAM use on MacOS has been reasonably constant for the last 10 or so.
Who knows, maybe Apple will let people swap the SoC in a year or so from now. For a price, of course. But in the meantime there's no denying that they have a fairly compelling value proposition: improved performance and significantly improved battery life. That said, I certainly won't be switching to Big Sur until I know what the restrictions are and when they've really fixed all the bugs. I've skipped versions in the past when it was clear they were too buggy.
-
Friday 20th November 2020 10:08 GMT Dave 126
>Who knows, maybe Apple will let people swap the SoC in a year or so from now.
Mac users needing an upgrade tend to sell their existing machine and buy a new one. They don't normally depreciate that quickly (though we'll see what these M1 Macs do to the resale value of recent Intel Macs.)
Generally Apple want you to over spec, it makes it easier to introduce new features to a larger pool of capable machines. For example, Apples cheapest, thus baseline, machines now have graphics capabilities better than entry level discrete GPUs. Soon, all Macs will have, at minimum, a fair bit of GPU prowess, and developers can work with that.
-
-
Friday 20th November 2020 14:16 GMT Wayland
It's actually a very justified reason for integrating the memory in the same device. I'll have some HBM with my CPU.
What I'd like to see from AMD is HBM in their CPUs with DDR used as virtual memory in some way. I'm pretty sure if you've got 16GB in the chip that will handle most things but with external DDR you won't run out.
-
-
Friday 20th November 2020 09:41 GMT Dave K
SGI machines never used a SIP design to limit expandability however, all of them used memory modules and could have their RAM configuration modified at will. I presume you're referring more to the unified memory architecture used on the O2 and on some of their Intel Visual Workstations I believe. If so, there are similarities here, but execution is quite different.
-
Thursday 19th November 2020 13:35 GMT Steve Davies 3
Unified Memory
Isn't that just a new name for 'Shared Memory'?
I'm sure that a lot of readers of this esteemed site remember that.
You know when the Grapics Card used a great chunk of the CPU memory because the Graphics Card makers were cheapskates.
Things improved when graphics cards started to get really fast RAM.
At least Apple have made the memory bandwidth really, really large. IMHO, that's where a good deal of the performance comes from.
This CPU will give a few other chip designers a lot to think about.
But hey... Apple can't innovate can they? (sic)
-
Thursday 19th November 2020 13:51 GMT Anonymous Coward
Re: Unified Memory
The thing is, HBM stacks have 16 effectively independent data paths (each > 20 GB/s for HBM2e) so the system architect can assign some to the CPU and some to the GPU. It looks like that design has 2 stacks, so 32x 20GB/s should be enough to go round.
(By the way, yes there are applications that will saturate the bandwidth of 2 HBM stacks though you will find them running on GPGPUs for HPC).
Edit: by the way, something like this was done by Intel with the truly weird and wonderful Kaby Lake G - Intel CPU, AMD GPU, and HBM all in the same package. There is an important difference though: in that product the HBM was only attached to the GPU, and the CPU retained a traditional external memory interface.
-
Thursday 19th November 2020 14:25 GMT Anonymous Coward
Re: Unified Memory
Unified memory I believe creates the same addressing scheme across users unlike shared memory, where it can allocate from the pool, but the allocation cannot be moved, it can be copied and freed.
This is the difference between shared and unified memory. Unified is a super set of shared - they theoretically can see all the memory and a handle alone can move data across cores/users.
This gives a significant performance advantage for offload and heterogenous compute loads.
-
Thursday 19th November 2020 14:40 GMT Mark Honman
Re: Unified Memory
Hmm, that's a pretty convincing explanation of the unified vs shared mem difference. I've always struggled to understand the difference between OpenCL's USM and SVM, maybe this is the key...
If I get this right, the difference is that in a heterogenous shared memory scenario, the memory appears at different locations in each device's address space so pointers are not translatable. For example if the CPU builds a linked list in shared memory, the accelerator cannot just dereference the pointers.
-
-
-
Thursday 19th November 2020 14:20 GMT JDX
So what is the neural engine for?
Apple are making a big deal about ML capabilities on the M1 chippery but what use does this have in a commercial laptop used for gaming/music production/video editing/general use?
Is it used for everyday purposes I'm simply not aware of? I can't imagine Apple would bother with it for no reason but other than possibly Siri, I'm struggling to think why this is a core part of the system.
Anyone got a good answer?
-
Thursday 19th November 2020 14:30 GMT Anonymous Coward
Re: So what is the neural engine for?
Photo touchup and editing algos uses inferencing a lot. Edge detection, contour, face detect come to mind. These can leverage the NPU, just like GPUs do for graphics.
Others are for applications that customise per user - such as your usage patterns. These can use inferencing as well.
I'd imagine games could use it for some aspects of game play. This is a tiny subset of examples.
Just like graphics/GPU, the CPU could do it, but it is far more efficient and faster with an NPU.
-
Friday 20th November 2020 14:16 GMT Wayland
Re: So what is the neural engine for?
Developers who use software or GPU based neural processing can take advantage of hardware based version. This converting software to hardware has been going on since the first microprocessor. Today's buzzwords A.I and Machine Learning are fancy terms for a bruit force way of problem solving, hardware helps a lot.
-
-
Tuesday 24th November 2020 15:21 GMT Michael Wojcik
Re: So what is the neural engine for?
There are various popular APIs. Apple has its own Core ML.
The ML hammer can be used on a wide variety of nails. Whether a given application is a wise use of ML technology is another question, of course, but it shouldn't be hard for application architects to find uses for that ANN core.
-
-
-
-
This post has been deleted by its author
-
-
Thursday 19th November 2020 16:00 GMT Anonymous Coward
Re: Performance tricks
"The SoC has access to 16GB of unified memory. This uses 4266 MT/s LPDDR4X SDRAM (synchronous DRAM) and is mounted with the SoC using a system-in-package (SiP) design. A SoC is built from a single semiconductor die whereas a SiP connects two or more semiconductor dies."
The DRAM is not on the same chip. Or die.
-
-
-
Sunday 22nd November 2020 15:03 GMT StrangerHereMyself
Re: Performance tricks
Yes, I was wrong. From early descriptions I gathered that it was either located in the same package or on the same die.
That would've explained why the memory wasn't expandable. It seems they've opted for putting the DRAM *very* close to the SoC to increase bandwidth.
-
-
-
-
-
Thursday 19th November 2020 14:45 GMT Detective Emil
Sowing confusion
Apple really isn't helping by calling this high‑bandwidth, low‑latency memory, because, despite it being LPDDR4X, people are likely to confuse it with High Bandwidth Memory, a JEDEC standard. Indeed, a poor translation on Apple Finnish store (since corrected) actually said "High Bandwidth Memory".
On the "unified memory is old hat" theme, yes indeed: there's a now-expired Apple patent concerning it from 1996.
-
Friday 20th November 2020 09:55 GMT Dave 126
Re: Sowing confusion
Yep, it's not HBM. Confirmed by Andrei over at Anandtech:
https://www.anandtech.com/comments/16226/apple-silicon-m1-a14-deep-dive/728721
However, what Apple have done is design a microarchitecture that retains and releases NSObjects (an operation used a lot by OSX applications) five times quicker than Intel chips can.
"Native code running on Apple Silicon is not 5 times faster than on Intel, generally, nor is Intel software running under Rosetta on Apple Silicon twice as fast as on Intel. But retaining and releasing NSObjects is so common on MacOS (and iOS), that making it 5 times faster on Apple Silicon than on Intel has profound implications on everything from performance to battery life."
https://daringfireball.net/2020/11/the_m1_macs
-
-
Thursday 19th November 2020 15:30 GMT Pascal Monett
So now Apple had made it's entire RAM space shared
I seem to recall not so long ago that Intel had a big problem with its CPU architecture that could allow programs to access kernel memory, and many people were all in tizzy about it.
Now, Apple brings a system-on-a-chip that shares all its memory space.
Is there no problem with that ?
-
Thursday 19th November 2020 19:36 GMT DS999
Re: So now Apple had made it's entire RAM space shared
CPUs have always relied on memory protection to separate kernel and user pages. Unified memory used to be the norm, it only started getting separated because giving GPUs their own DRAM increases their performance since accessing via the PCIe bus is slow.
-
Thursday 19th November 2020 19:45 GMT Anonymous Coward
Re: So now Apple had made it's entire RAM space shared
Not sure.
In principle in a PC a PCIe device that can bus-master can access any region of memory it wishes via DMA transfers. The CPU is none the wiser. This was a problem in early Thunderbolt on Mac (PCIe down a wire). So we're quite content with that situation (now that Thunderbolt has been improved), so there's probably not too much to worry about with this architecture. There's no real security difference I can see between devices being able to directly address memory and being able to freely DMA to/from it.
But I suppose the point is that, whilst Spectre was all about code running on the CPU being able to mess around with caches and learn about other memory content, now its other things too (GPU code, the neural engine). Worse, the fact that an attempt at something like Spectre might be distributed between all three instead of just the CPU could make it very hard to spot in advance.
-
Friday 20th November 2020 14:17 GMT Wayland
Re: So now Apple had made it's entire RAM space shared
If we compare running a task on a CPU to a GPU we see that the CPU has all sorts of fancy rings and layers protecting different processes to each other. Where as a GPU is just raw power in a parallel format.
I'm not sure how sophisticated an ARM is these days but it would need all those fancy x86 things in order to replace it on the desktop and in the server.
-
Friday 20th November 2020 17:25 GMT DS999
Re: So now Apple had made it's entire RAM space shared
No it doesn't. x86 doesn't even use all those rings itself under Windows or Linux. It uses standard memory protection capability which both x86 and ARM supports. Don't believe the lie that somehow ARM is only suitable for phones but can't run desktops or servers (it has been running in servers at places like Amazon for a couple years now)
And nevermind that both iOS and Android are far more complex operating systems than any desktop or server was running not that long ago.
-
-
-
Thursday 19th November 2020 16:16 GMT Jason Hindle
Long term, I think we will see expansion options
With good memory management, perhaps we could expect little or no noticeable performance decrease where there is a mix of integrated and DIMM memory. That said, it is starting to look like what memory you have goes further on these new Macs (various demos of users maxing out the 8GB MacBooks; lots of vigorous debate around this).
The MacBook Air 2020 M1 sitting behind me (humble base model I've just bought)? Erm, runs like the clappers. Office under Rosetta? About 20 seconds when launching each application for the first time. Near as damnit instant thereafter (proving there is a pre-process step). So, clappers for Office also. I'm a big believer in the capitalist principle that competition is good. This is that massive (and much needed) boot up Intel's arse.
-
Thursday 19th November 2020 16:41 GMT phuzz
Re: Long term, I think we will see expansion options
This is that massive (and much needed) boot up Intel's arse.
Especially with AMD firmly taking the high end performance crown from them with Zen3, this is probably not a fun time for Intel.
As you say competition is good. Last time Intel were getting clobbered by AMD (Athlon 64 vs Pentium IV), Intel went back to the drawing board and came back with the Core architecture, so hopefully they'll take this opportunity to do the same.
-
Thursday 19th November 2020 20:26 GMT O RLY
Re: Long term, I think we will see expansion options
Don't forget Intel was ALSO paying Dell $1 billion/year not to sell AMD-based systems at that time. Dell, the company, and Dell, the man, each paid small fines. Dell, the company, restated earnings for several years, their CEO resigned and several CFOs cycled through to clean up the mess. Hopefully Intel isn't cheating this time.
-
-
-
Thursday 19th November 2020 16:32 GMT martinusher
Great Block Diagram
I like the block diagram that illustrates the construction of this processor. Very informative. Apparently there are boxes connected through a box called 'fabric'. I shall remember this technique when I do my next design.
Reading between the lines it looks like the key design points are that the memory is physically close to the processing elements which allows the memory to be synchronized with the processor clocks. There's also no mention of (L1) cache, suggesting that the memory is effectively the cache. So at the price of some trickery in the memory controller design (maybe just alternating access between main processors and GPUs?) you get rid of all the overhead of managing cache misses.
(Shared memory is as old as the hills and then some. I got lumbered with a university project in the mid-1970s that rehabilitated a GPU that was designed to share memory with the main processor. I had to make it standalone and provide it with an interface to a different system. Logically straightforward but very tedious because the technology used was prehistoric (it was "transistorized", though). The GPU was from an old English Electric computer, it was really very well designed for somthing that old.)
-
Thursday 19th November 2020 19:37 GMT Pascal Monett
Re: it was really very well designed for somthing that old
Of course it was really well designed. You're talking about an era when Boeing employed redundancy, when computer engineers were actual engineers and when said engineers knew what was going on electrically in their designs.
And they tested their designs before selling them.
-
Friday 20th November 2020 11:17 GMT Kristian Walsh
Re: Great Block Diagram
This design doesn't replace cache; it just adds a really low-overhead method of accessing the general system RAM pool.
The L1/L2 caches are still present as normal, although big cores have more than small ones, which confuses some benchmarking software. Total L1 seems to be 192k instruction+128k data, and Apple itself says there's 12 Mbyte L2 cache. Big cores have twice as much L2 cache as small ones, but it's unclear how L1 is allocated to each core.
-
-
Thursday 19th November 2020 19:35 GMT Anonymous Coward
Apple leading the way once more
I'm sure the haters will manage to pick holes but these stats simply annihilate the opposition.
M1 Arm chip delivers up to 3.5x faster CPU performance, up to 6x faster GPU performance, up to 15x faster machine learning, and up to 2x longer battery life than previous-generation Macs, which use Intel x86 CPUs.
Expect Windoze boxes to follow suit in short order.
-
Thursday 19th November 2020 19:40 GMT DS999
Re: Apple leading the way once more
Those stats (which are mostly hyperbolic anyway) have almost nothing to do with Apple putting LPDDR4x on package despite what this article claims. Heck, you can buy DDR4-4266 DIMMs, though they are not a JEDEC standard and used primarily by overclockers.
Given that Apple controls all their own hardware now nothing would stop them from designing systems to use DDR4-4266 DIMMS and getting the exact same performance on a more "traditional" system. Actually you'd get BETTER performance since LPDDR standards trade a bit of latency for power savings (Intel and AMD systems have lower memory latency than Apple's M1 Macs) That wasn't possible when they used Intel CPUs since that clock rate is not officially supported, but Apple can officially support whatever they choose now.
-
Thursday 19th November 2020 20:02 GMT Nate Amsden
Re: Apple leading the way once more
Several folks seem to think this performance will be possible on Windows anytime soon. MS partnered with Qualcomm for their ARM stuff and it seems to be weak by comparison. Qualcomm's ARM datacenter chips went nowhere as well. The trend of higher performing processors on mobile Apple vs Android seems to have been going on for a long time. While there are others that make ARM on mobile it seems general opinion is Qualcomm is by far the best/fastest when it comes to Android.
Things would be totally different if Apple had any history of licensing their chip designs or even agreeing to sell their chips to other companies but they have no interest in doing so(no signs of that changing). Also not as if MS (or google) can encourage Apple financially given Apple has so much money in the bank.
Apple has certainly accomplished some amazing stuff by vertically integrating all of this, really good work. I'm certainly not their target market so won't be using this myself but for many people it will be good.
Will be interesting to see how this affects market share in these segments I'm guessing Apple will pick up quite a bit vs Windows. Lots of folks touted OS X as being a great easy to use OS, but add into that this new processor and the speed/battery savings it gives it's pretty amazing.
If anything this won't obviously inspire significant fear from Qualcomm or other ARM vendors because Apple's locked in ecosystem. They can't sell into IOS/OS X, and vise versa. Just look at the progress of processors in the wearable space for comparison. I have read Apple has made quite a bit of progress there over the years meanwhile many others either got out of the space or let their designs sit for years without improvements.
Since MS can't go to Apple to buy chips, they are sort of stuck. Same for Google. Sure MS or Google could design their own chips like Apple but it would take many years before they are viable like this (assuming they ever get to that point before being killed off).
-
Friday 20th November 2020 11:57 GMT Kristian Walsh
Re: Apple leading the way once more
That argument assumes that Apple has a runaway lead in SoC performance on Mobile, which is not true.
Apple's mobile SoCs tend to score very well on single-core benchmarks, but fall back into the pack when you look at multi-core scores. For example, Apple's A14 is the leading SoC on single-core benchmarks, but the Kirin 9000 beats it on multi-core tests, and early Snapdragon 875 results also show a significant lead in multi-core benchmarks over Apple.
Basically, Microsoft, Google, HP, Dell do have options if they want to pursue ARM for desktop, and the performance-boost of directly-attached RAM that Apple has used on M1 is not a new idea, and can be adapted to existing systems. It sacrifices any chance of upgrading RAM, though, which could limit its attractiveness in the Windows market, where enterprise IT procurement policies have a lot of power over what gets sold.
-
Friday 20th November 2020 17:28 GMT DS999
Re: Apple leading the way once more
The other SoCs only beat it in multi core because they have twice as many big cores. The situations where you are maxing out all cores on a phone are pretty rare, so those multithread scores are mostly for bragging rights. If Apple thought that mattered they'd put a couple more big cores in the iPhone SoCs, but they only do that in the iPad Pro because that's where it will actually matter.
-
Friday 20th November 2020 18:57 GMT Wayland
Re: Apple leading the way once more
If the SoC is socketed then a RAM upgrade could be like upgrading the CPU. Virtual Memory has been around for decades so if the external DDR was presented as a RAM drive then it could be used as super fast swap space. A slight change in the architecture but the OS won't notice.
-
-
Friday 20th November 2020 14:29 GMT Anonymous Coward
Re: Apple leading the way once more
"M1 Arm chip delivers up to 3.5x faster CPU performance, up to 6x faster GPU performance, up to 15x faster machine learning, and up to 2x longer battery life than previous-generation Macs, which use Intel x86 CPUs."
3.5x faster than a 3 year old Intel CPU that was neither the fastest or most power efficient option.
Given Intels current CPU production woes, it is likely to have been the right choice by Apple, but I would still expect a bumpy ride over the next year for ARM users as usability/performance issues are ironed out.
On the Windows side (or other non-OSX OS), users are less constrained by Apples CPU choices so the performance advantages are less apparent.
-
-
Friday 20th November 2020 10:03 GMT Dave 126
It's not HMD!
This idea that the M1 uses HMD came from a mistranslation of Apples Finnish website. This has been confirmed by Anandtech.
What Apple have done is design a microarchitecture that is very fast at doing common operations in the macOS programming framework, according to John Gruber. Like, 6.5 nanoseconds, compared to 30 nanoseconds on Intel.
-
Friday 20th November 2020 10:50 GMT Dave 126
>So what is the neural engine for?
If no machines have a neural engine, no devs will bother developing for it. Apple have the view that if they include a new, unused type of hardware in a popular machine without the user specifically choosing it, devs will develop for it and the end user will see the benefit.
This has worked for Apple before. Most Mac users didn't use FireWire, but that it was once included in all Macs opened the doors to ideas like the iPod (MK I was FireWire because USB 1 wasn't fast enough for it).
-
Friday 20th November 2020 15:51 GMT ThomH
Right; it's exposed to all developers as Core ML and for now is slowly creeping into image, video and audio editors. Pixelmator Pro jumps on it for image processing, for example. It's not clear to me that there's more here than you'd get with a modern dedicated GPU on any other computer though so you're probably just looking at Apple optimising to do the task as best as can be done within the confines of a mobile SoC.
I'd be completely out of my depth trying to say anything beyond that.
-
Friday 20th November 2020 12:47 GMT Dominic Sweetman
At a few GHz, CPUs are memory-limited. Once you get on-chip caches you can make the CPU faster, until cache misses dominate the workload. A 2GHz wide-issue CPU with cunning insides can probably perform 3-4 instructions per nanosecond. In classical PC a DRAM access must go off chip, off module, through a connector, across the tracks and through a DRAM interface. That's going to take perhaps 80-100 nanoseconds: could be longer, and the more expandable the memory is the longer it will take. That represents about 250-400 instructions you didn't execute because you're waiting for memory. Big caches lower cache miss rates, but only to the low single-digits of percentage. Fast CPUs running one thread (most laptops have one impatient user waiting for one thing to happen) spend the great majority of their time waiting for memory.
Meanwhlie, Moore's law continues to work for memory density. A laptop with 16GB of memory sounds pretty usable, and should get you a big performance and battery use. A trade-off well worth making.
-
Friday 20th November 2020 18:36 GMT Disk0
I for one
welcome our artificially intelligent SoC's.
Progress is a beautiful thing. This old ad comes to mind: <https://tinyurl.com/y2cwjryr> ...Now featuring butterfly wings!...
And I very much like that this is a significantly different architecture.
We've been stuck with the x86 monoculture for too long.
I am also looking forward to some form of multicore high-performance Raspberry Pi to power laptops - it can't be long now.
Let's see who can make the lightest, most efficient and most performant architecture. We will all win, and I want them all.