!"#$%!&'()#* !"#$%&'((')*+, · -.'/0 &'((') +(,%#-%.//0)1-%23%4(#/%56%78-9: 1*2(%)'+(3%*45*+6*78*%)609%09*%7*)%:;%:28,%92<*%,02+0*3%068=67>%67"%?9*@%2+*%A2,0" 1*2(%A2,0"%B/0%)9@C%D920%6,%09*%E2>68C F+6=%F7>9*6E !%32@,%2>' · ;G%E67%+*23 On Youtube I watched a Mac user who had bought an iMac last year. It was maxed out with 40 GB of RAM costing him about $4000. He watched in disbelief how his H'/%92<*% ; %A+**%E*E.*+I'7(@%,0'+6*,%(*A0%096,%E'709" J6>7%/5%A'+%:*36/E%273%>*0%27%*40+2%'7* K*0%,02+0*3 L5*7%67%255 hyper expensive iMac was being demolished by his new M1 Mac Mini, which he had paid a measly $700 for. In real world test after test, the M1 Macs are not merely inching past top of the line Intel Macs, they are destroying them. In disbelief people have started asking how on earth this is possible? If you are o ne of those people, you have come to the right place. Here I plan to break it down into digestible pieces exactly what it is that Apple has done with the M1. SpeciKcally the questions I think a lot of people have are: 1. What is the technical reasons this M1 chip is so fast? 2. Has Apple made some really exotic technical choices to make this possible? 3. How easy will it be for the competition such as Intel and AMD to pull the same technical tricks? Sure you could try to Google this, but if you try to learn what Apple has done beyond the superKcial explanations, you will quickly get buried in highly technical jargon such as M1 using very wide instruction decoders, enormous re-order buTer (ROB) etc. Unless you are a CPU hardware geek, a lot of this will simply be gobbledegook. To get the most out of this story I advice reading my earlier piece: RISC and CISC mean in 2020 ? There I explain what a microprocessor (CPU) is as well various important concepts such as: Instruction Set Architecture (ISA) Pipelining Load / Store Architecture Microcode vs Micro-operations But if you are impatient, I will do a quick version of the material you need to understand to grasp my explanation of the M1 chip. !"#$%&'%#%(&)*+,*+)-''+*%./0123 Normally when speaking of chips from Intel and AMD we talk about central processing units (CPUs) or microprocessors. As you can read more about in my RISC vs CISC story , these pull in instructions from memory. Then each instruction is typically carried out in sequence. -%<*+@%.2,68%1MJN%NOPQ%7'0%09*%:;"%M7,0+/806'7,%2+*%E'<*3%A+'E%E*E'+@%2('7>% <0=) %2++'),%670' 67,0+/806'7%+*>6,0*+"%?9*+*%2%3*8'3*+%A6>/+*,%'/0%)920%09*%67,0+/806'7%6,%273%*72.(*,%36AA*+*70%52+0,%'A%09* NOP%09+'/>9%09*% ")> %8'70+'(%(67*,"%?9*%-RP%233,%273%,/.0+280,%7/E.*+,%5(28*3%67%09*%+*>6,0*+," A CPU at its most basic level is a device with a number of named memory cells called registers and a number of computational units called arithmetic logic units (ALU). The ALUs perform things like addition, subtraction and other basic math operations. However these are only connected to the CPU registers. If you want to add up two numbers, you have to get those two numbers from memory and into two registers in the CPU. Here are some examples of typical instructions that a RISC CPU as found on the M1 carries out. Here r1 and r2 are the registers I talked about. Modern RISC CPUs cannot do operations on numbers which are not in a register like this. E.g. it cannot add two numbers residing in RAM in two diTerent locations. Instead it has to pull these two numbers into a separate register. That is what we do in this simple example. We pull in the number at memory location 150 in the RAM and put it into register r1 in the CPU. Next we put the contents of address 200 into register r1 . Only then can the numbers be added with the add r1, r2 instruction. -7%'(3%E*8927682(%82(8/(20'+%)609%0)'%+*>6,0*+,S%09*%288/E/(20'+%273%675/0%+*>6,0*+"%:'3*+7%NOP, 0@5682((@%92<*%E'+*%0927%2%3'T*7%+*>6,0*+,Q%273%09*@%2+*%*(*80+'768%+209*+%0927%E*8927682(" The concept of registers is old. E.g. on this old mechanical calculator, the register is what holds the numbers you are adding. Likely the origin for the word cash register The register is where you registered input numbers. 4"-%(5%&'%6+$%#%/017 But here is a very important thing to understand about the M1: load r1, 150 load r2, 200 add r1, r2 store r1, 310 !"#$%&$'($)*+$,$-./0$'+$'($,$1"*2#$(3(+#4$*5$462+'72# 8"'7($76+$')+*$*)#$2,9:#$('2'8*)$7,8;,:#<$!"#$-./ '($=6(+$*)#$*5$+"#(#$8"'7(< Basically the M1 is one whole computer onto a chip. The M1 contains CPU, Graphical Processing Unit (GPU), memory, input and output controllers and many more things making up a whole computer. This is what we call a System on a Chip (SoC). :;%6,%2%J@,0*E%'7%2%N965"%:*2767>%2((%09*%52+0,%E2=67>%/5%2%8'E5/0*+%6,%5(28*%'7%'7*%,6(68'7%8965" Today if you buy a chip whether from Intel or AMD, you actually get what amounts to multiple microprocessors in one package. In the past computers would have multiple physically separate chips on the motherboard of the computer. F42E5(*%'A%2%8'E5/0*+%E'09*+.'2+3"%:*E'+@Q%NOPQ%>+25968,%82+3,Q%ML%8'70+'((*+,Q%7*0)'+=%82+3%273 E27@%'09*+%8'E5'7*70%827%.*%200289*3%0'%09*%E'09*+.'2+3%0'%8'EE/76820*%)609%*289%'09*+" However because we are able to put so many transistors on a silicon die today, companies such as Intel and AMD began putting multiple microprocessors onto one chip. Today we refer to these chips as CPU cores. One core is basically a full independent chip which can read instructions from memory and perform calculations. -%E68+'8965%)609%E/(065(*%NOP%8'+*," This has for a long time been the name of the game in terms of increasing performance: Just add more general purpose CPU cores. But there is a disturbance in the force. There is one player in the CPU market which is deviating from this trend. 8,,9-:'%;+$%<+%<-)*-$%=-$-*+>-6+?'%/+@,?$&6>%<$*#$->A Instead of adding ever more general purpose CPU cores, Apple has followed another strategy: They have started adding ever more specialized chips doing a few specialized tasks. The beneKt of this is that specialized chips tend to be able to perform their tasks signiKcantly faster using much less electric current than a general purpose CPU core. This is not entirely new knowledge. For many years already specialized chips such as the graphical processing units (GPUs) have been sitting in Nvidia and AMD graphics cards performing operations related to graphics much faster than general purpose CPUs. What Apple has done is simply to take a more radical shift towards this direction. Rather than just having general purpose cores and memory, the M1 contains a wide variety of specialized chips: Central Processing Unit (CPU) — The “brains” of the SoC. Runs most of the code of the operating system and your apps. Graphics Processing Unit (GPU) — Handles graphics-related tasks, such as visualizing an app’s user interface and 2D/3D gaming. Image Processing Unit (ISP) — Can be used to speed up common tasks done by image processing aplications. Digital Signal Processor (DSP) — Handles more mathematically intensive functions than a CPU. Includes decompressing music Kles. Neural Processing Unit (NPU) — Used in high-end smartphones to accelerate machine learning (AI) tasks. These include voice recognition and camera processing. Video encoder/decoder — Handles the power-ehcient conversion of video Kles and formats. Secure Enclave — Encryption, authentication and security. UniKed memory — Allows the CPU, GPU and other cores to quickly exchange information. This is part of the reason why a lot of people working on images and video editing with the M1 Macs are seeing such speed improvements. A lot of the tasks they do, can run directly on specialized hardware. That is what allows a cheap M1 Mac Mini to encode a large video Kle, without breaking sweat while an expensive iMac has all its fans going full blast and still cannot keep up. M7% <0=) %@'/%,**%E/(065(*%NOP%8'+*,%288*,,67>%E*E'+@Q%273%67% '"))& %@'/%,**%(2+>*%7/E.*+,%'A%KOP%8'+*, 288*,,67>%E*E'+@" UniKed memory may confuse you. How is it diTerent from shared memory? And wasn’t sharing video memory with main memory a terrible idea in the past giving low performance? Yes, shared memory was indeed bad. The reason was that the CPU and GPU had to take turns accessing the memory. Sharing it meant contention to use the databus. Basically the GPUs and CPUs had to take turns using a narrow pipe to push or pull data through. That is not the case with UniKed memory. In UniKed memory the GPU cores and CPU cores can access memory at the same time. Thus in this case there is no overhead in sharing memory. In addition the CPU and GPU can tell each other about where some memory is located. Previously the CPU would have to copy data from its area of the main memory to the area used by the GPU. With uniKed memory, it is more like saying “Hey Mr. GPU, I got 30 MB of polygon data starting at memory location 2430.” The GPU can then start using that memory without doing any copying. That means you can signiKcant performance gains by the fact that all the various special co-processors on the M1 can rapidly exchange information with each other by using the same memory pool. U')%:28V,%/,*3%KOP,%.*A'+*%/76A6*3%E*E'+@"%?9*+*%)2,%*<*7%27%'506'7%'A%92<67>%>+25968,%82+3,%'/0,63* 09*%8'E5/0*+%/,67>%2%?9/73*+.'(0%W%82.(*"%?9*+*%6,%,'E*%,5*8/(206'7%0920%096,%E2@%,06((%.*%5',,6.(*%67%09* A/0/+*" !"A%B+6:$%C6$-9%#6D%8(B%/+,A%4"&'%<$*#$->A3 If what Apple is doing is so smart, why are not everybody doing it? To some extent they are. Other ARM chip makers are increasingly putting in specialized hardware. AMD has also started putting stronger GPUs on some of their chips and moving gradually towards some form of SoC with the accelerated processing units (APU) which are basically CPU cores and GPU cores placed on the same silicon die. -:X%1@T*7%-88*(*+20*3%O+'8*,,67>%P760%Y-OPZ%)9689%8'E.67*,%NOP%273%KOP%Y123*'7%[*>2Z%'7%'7* ,6(68'7%8965"%X'*,%9')*<*+%7'0%8'70267%'09*+%8'I5+'8*,,'+,Q%MLI8'70+'((*+,%'+%/76A6*3%E*E'+@" Yet there are important reasons why they cannot do this. An SoC is something naturally the computer maker such as Dell and HP make, since an SoC is essentially a whole computer on a chip. This works Kne for ARM, because a company such as Dell or HP would simply license ARM intellectual property and buy various IP for other chips possibly from ARM to add whatever specialized hardware they think their SoC should have. Then they ship the specs over over to a semiconductor foundry such as GlobalFoundries or TSMC , which manufactures chips for AMD and Apple today. ?J:N%,*E68'73/80'+%A'/73+@%67%?26)27"%?J:N%E27/A280/+*,%8965,%A'+%'09*+%8'E5276*,%,/89%2,%-:XQ -55(*Q%\<6362%273%]/2(8'EE" Here we get a big problem with the Intel and AMD business model. Their business models are based on selling general purpose CPUs, which people just slot in on a large PC motherboard. Thus computer makers can simply buy motherboards, memory, CPUs and graphics cards from diTerent vendors and integrate them to one solution. But we are quickly moving away from that world. In the new SoC world you don’t assemble physical components from diTerent vendors. Instead you assemble IP (intellectual property) from diTerent vendors. You buy the design for graphics cards, CPUs, modems, IO controllers and other things from diTerent vendors and use that to design a SoC in-house. Then you get a foundry to manufacture this. Now you got a big problem, because neither Intel, AMD or Nvidia are going to license their intellectual property to Dell or HP for them to make an SoC for their machines. Sure Intel and AMD may simply begin to sell whole Knished SoCs. But what are these to contain? PC makers may have diTerent ideas of what they should contain. You potentially get a coniict between Intel, AMD, Microsoft and PC makers about what sort of specialized chips should be included because these will need software support. For Apple this is simple. They control the whole widget. They give you e.g. the Core ML library for developers to write machine learning stuT. Whether Core ML runs on Apple’s CPU or the Neural Engine is an implementation detail developers don’t have to care about. 4"-%E?6D#@-6$#9%/"#99-6>-%+F%(#G&6>%86A%/01%H?6%E#'$ So heterogenous computing is part of the reason but not the sole reason. The fast general purpose CPU cores on the M1, called Firestorm are genuinely fast. This is a major deviation from ARM CPU cores in the past which tended to be very weak compared to AMD and Intel cores. Firestorm in contrast beats most Intel cores and almost beats the fastest AMD Ryzen cores. Conventional wisdom said that was not going to happen. Before talking about what makes Firestorm fast it helps to understand what the core idea of making a fast CPU is really about. In principle you accomplish in a combination of two strategies: 1. Perform more instructions in a sequence faster. 2. Perform lots of instructions in parallel. Back in the 80s, it was easy. Just increase the clock frequency and the instructions would Knish faster. Every clock cycle is when the computer does something. But this something can be quite little. Thus an instruction may require multiple clock cycles to Knis because it is made up of several smaller tasks. However today increasing the clock frequency is next to impossible. That is the whole “End of Moore’s Law” that people have been harping on for over a decade now. Thus it is really about executing as many instructions as possible in parallel. (?9$&I/+*-%+*%J?$I+FIJ*D-*%0*+)-''+*'3 There are two approaches to this. One is to add more CPU cores. From the point of view of a software developer it is like adding threads . Every CPU core is like a hardware thread. If you don’t know what a thread is, then you can think of it as the process of carrying out a task. With two cores, a CPU can carry out two separate tasks concurrently: two threads. The tasks could be described as two separate programs stores in memory or it could actually be the same program performed twice. Each thread needs some book-keeping, such as where in sequence of program instructions the thread is currently at. Each thread may store temporary results which should be kept separate. In principle a processor can have just one core and run multiple threads. In this case it simply halts one thread and stores current progress before switching to another. Later it switches back. This doesn’t bring much of a performance enhancement and is only used when a thread may frequently halt to wait for input from user, data from a slow network connection etc. These may be called software threads. Hardware threads means you have actual extra physical hardware such as extra cores at your disposal to speed up things. The problem with this is that the developer has to write code to take advantage of this. Some tasks such as sever software is easy to write like this. You can imagine processing each connecting user separate. These tasks are so independent from each other that having lots of cores is an excellent choice for servers especially cloud based services. ?9*%-E5*+*%-(0+2%:24%-1:%NOP%)609%;!#%8'+*,%3*,6>7*3%A'+%8('/3%8'E5/067>Q%)9*+*%2%('0%'A%92+3)2+* 09+*23,%6,%2%.*7*A60" That is the reason why you see ARM CPUs makers such as Ampere making CPUs such as the Altra Max which has a crazy 128 cores. This chip is speciKcally made for the cloud. You don’t need crazy single core performance because in the cloud it is all about having as many threads as possible per watt to handle as many concurrent users as possible. Apple in contrast is in the complete opposite end of the spectrum. Apple makes single user devices. Lots of threads is not an advantage. Their devices are used for gaming, video editing, development etc. They want desktops with beautiful responsive graphics and animations. Desktop software is generally not made to utilize lots of cores. E.g. computer game will likely beneKt from 8 cores, but something like 128 cores would be a total waste. Instead you would want fewer but more powerful cores. So here is the interesting thing, Out-of-Order execution is a way to execute more instructions in parallel but without exposing that capability as multiple threads. Developers don’t have to code their software speciKcally to take advantage of it. Seen from the developer’s perspective it just looks like each core runs faster. To understand how this works, you need to understand some things about memory. Asking for data in one particular memory location is slow. But there is not diTerence in delay getting 1 byte compared to getting say 128 bytes. Data is sent across what we call a databus. You can think of it as a road or pipe between memory and diTerent parts of the CPU where data gets pushed through. In reality it is of course just some copper tracks conducting electricity. If the databus is wide enough you can just get multiple bytes at the same time. Thus CPUs get a whole chunk of instructions at a time to execute. But they are written to be executed one after the other. Modern microprocessors do what we call Out-of-Order (OoO) execution. That means they are able to analyze a buTer of instructions quickly and see which ones depend on on which. Look at the simple example below: 01: mul r1, r2, r3 // r1 ← r2 × r3 Multiplication tends to be a slow process. So say it takes multiple clock cycles to perform. The second instruction will simply have to wait because its calculation depends on knowing the result that gets put into the r1 register. However the third instruction at line 03 doesn’t depend on calculations from previous instructions. Hence an Out-of-Order processor can begin calculating this instruction in parallel. However more realistically we are talking about hundreds of instructions. The CPU is able to Kgure out all the dependencies between these instructions. It analysis the instructions by looking at the inputs to each instruction. Does the inputs depend on output from one or more other instructions? By input and output we mean registers containing results from previous calculations. E.g. the add r4, r1, 5 instruction depends on input from r1 which is produced by mul r1, r2, r3 . We can chain together these relationships into long elaborate graphs which the CPU can work through. The nodes are the instructions and the edges are the registers connecting them. The CPU can analyze such a graph of nodes and determine which instructions it can perform in parallel and where it needs to wait for the results from multiple dependent calculations before carrying on. Many instructions will Knish early but we cannot make their results ohcial. We cannot commit them, otherwise we supply the result in the wrong order. To the rest of the world it has to look as if the instructions where carried out in the same sequence as they where issued. Like a stack, the CPU will keep popping done instructions from the top, until hitting an instruction which is not done. We are not quite done with this explanation, but this gives you a bit of a clue. Basically you can have parallelism that the programmer must know or the kind 02: add r4, r1, 5 // r4 ← r1 + 5 03: add r6, r2, 1 // r6 ← r2 + 1 which the CPU fakes to look as if everything is single thread. However behind the scenes it is doing Out-of-Order black magic. It is the superior Out-of-Order execution which is making the Firestorm cores on the M1 kick ass and take names. It is in fact much stronger than anything from Intel or AMD. Likely stronger than from anybody else in the mainstream market. !"A%&'%8(B%#6D%C6$-9%J?$I+FIJ*D-*%KL-)?$&+6%C6F-*&+*%$+%(53 In my explanation of Out-of-Order execution (OoO) I skipped some important details, which needs to be covered. Otherwise it is not possible to understand why Apple is ahead of the game and Intel and AMD may not be able to catch up. The big “scratchpad” I talked about is actually called the Re-Order Bu7er (ROB), and it doesn’t contain normal machine code instructions. Not the ones that the CPU fetches from memory to execute. These are the instructions in the CPU Instruction Set Architecture (ISA). That is the kind of instructions that we call x86, ARM, PowerPC etc. However internally the CPU works on an entirely diTerent instruction-set invisible to the programmer. We call these micro-operations (micro-ops or μ ops). The ROB is full of these micro-ops. These are much more practical to work with for all the magic a CPU does to make stuT run in parallel. The reason is that micro-ops are very wide (contain a lot of bits) and can contain all sorts of meta-information. You cannot add that kind of information to an ARM or x86 instruction as it would: 1. Totally bloat the program binaries. 2. Expose details about how the CPU works, whether it has an OoO unit, has register renaming and many other details. 3. A lot of the meta information only makes sense in context of our current execution. You can think of this as when writing a program. You have a public API that needs to be stable and everybody use. That is the ARM, x86, PowerPC, MIPS etc instruction- sets. The micro-ops are basically the private APIs that are used to implement the public ones. Also micro-ops are usually easier to work with for the CPU. Why? Because they each do one simple limited task. Regular ISA instructions can be more complex causing a bunch of stuT to happen and thus actually translate to multiple micro-ops. For CISC CPUs there is usually no alternative but to use micro-ops otherwise the large complex CISC instructions would make pipelines and OoO next to impossible to achieve. RISC CPUs have a choice. So e.g. smaller ARM CPUs don’t use micro-ops at all. But that also means they cannot do things such as OoO. But you wonder why does any of this matter? Why is this detail important to know to understand why Apple has the upper hand on AMD and Intel? It is because the ability to run fast depends on how quickly you can Kll up the ROB with micro-ops and with how many. The more quickly you Kll it up and the larger it is the more opportunities you are given to pick instructions you can execute in parallel and thus improve performance. Machine code instructions are chopped into micro-ops by what we call an instruction decoder. If we have more decoders we can chop up more instructions in parallel and thus Kll up the ROB faster. And this is where we see the huge diTerences. The biggest baddest Intel and AMD microprocessor cores have 4 decoders, which means they can decode 4 instructions in parallel spitting out micro-ops. But Apple has a crazy 8 decoders. Not only that but the ROB is something like 3x larger. You can basically hold 3x as many instructions. No other mainstream chip maker has that many decoders in their CPUs. !"A%/#6:$%C6$-9%#6D%8(B%8DD%(+*-%C6'$*?)$&+6%B-)+D-*'3 This is where we Knally see the revenge of RISC, and where the fact that the M1 Firestorm core has an ARM RISC architecture begins to matter. You see, for x86 an instruction can be anywhere from 1 – 15 bytes long. On a RISC chip instructions are Kxed size. Why is that relevant in this case? Because splitting up a stream of bytes into instructions to feed into 8 diTerent decoders in parallel becomes trivial if every instruction has the same length. However on an x86 CPU the decoders have no clue where the next instruction starts. It has to actually analyze each instruction in order to see how long it is. The brute force way Intel and AMD deal with this is by simply attempting to decode instructions at every posssible starting points. That means we have to deal with lots of wrong guesses and mistakes which has to be discarded. This creates such a convoluted and complicated decoder stage, that it is really hard to add more decoders. But for Apple it is trivial in comparison to keep adding more. In fact adding more causes so many other problems that 4 decoders according to AMD itself is basically an upper limit for how far they can go. This is what allows the M1 Firestorm cores to essentially process twice as many instructions as AMD and Intel CPUs at the same clock frequency. One could argue as a counterpoint that CISC instructions turn into more micro-ops, that they are denser so that e.g. decoding one x86 instruction is more similar to decoding say two ARM instructions. Except this is not the case in the real world. Highly optimized x86 code rarely use the complex CISC instructions. In some regards it has a RISC iavor. But that doesn’t help Intel or AMD, because even if those 15 byte long instructions are rare, the decoders have to be made to handle them. This incurs complexity which blocks AMD and Intel from adding more decoders. M?$%8(B'%N-6O%/+*-'%#*-%<$&99%E#'$-*%H&>"$3 As far as I remember from performance benchmarks the newest AMD CPU cores, the ones called Zen3 are slightly faster than Firestorm cores. But here is the kicker, that only happens because the Zen3 cores are clocked at 5 GHz. Firestorm cores are clocked at 3.2 GHz. The Zen3 is just barely squeezing past Firestorm despite having almost 60% higher clock frequency. So why doesn’t Apple increase the clock frequency too? Because higher clock frequency makes the chips hotter. That is one of Apple’s key selling points. Their computers unlike Intel and AMD oTerings barely need cooling. In essence one could say Firestorm cores really are superior to Zen3 cores. Zen3 only manages to stay in the game by drawing a lot more current and getting a lot hotter. Something Apple simply chooses not to do. If Apple wants higher performance they are simply going to add more cores. That lets them keep watt usage down while oTering more performance. 4"-%E?$?*- It seems AMD and Intel have painted themselves into a corner on two fronts: They don’t have a business model which makes it easy to pursue heterogenous computing and SoC designs. Their legacy x86 CISC instruction-set is coming back to haunt them, making it hard to improve OoO performance. It doesn’t mean game over. They can of course simply clock up more, use more cooling, throw in more cores, beef up the CPU caches etc. But they are both at a disadvantage. Intel is in the worst situation, as their cores are already soundly beaten by Firestorm, and they have weak GPUs to integrate in a SoC solution. The problem with throwing in more cores is that for typical desktop workloads you reach diminishing returns with too many cores. Sure lots of cores are great for severs. However here companies such as Amazon and Ampere are attacking with monster CPUs with 128 cores. This is like Kghting the western and eastern front at the same time. But fortunately for AMD and Intel, Apple doesn’t sell their chips on the market. So PC users will simply have to put up with whatever they are oTering. PC users may jump ship, but that is slow process. You don’t leave immediately a platform you are heavily invested in. But young professionals, with money to burn without too deep investments in any platform, may increasingly turn to Apple in the future, beeKng up their hold on the premium market and consequently their share of the total proKt in the PC market. :; -55(*%J6(68'7 -55(* :68+'5+'8*,,'+ O*+A'+E278* -.'/0 U*(5 R*>2( K*0%09*%:*36/E%255