New High End CPU Arm Cortex-A77 – faster even without Moore's Law

RISC-V-Prozessorkerne von Andes Technology werden vom Markt schnell aufgenommen.

At TechDay by Arm, his chief architect introduced the new Cortex A77 application processor. Although the basic microarchitecture was adopted from the Cortex-A76, the processing power could be increased by 20% with identical manufacturing and the same clock frequency. How is this possible?

The development centers for Cortex-A are, as already mentioned several times, next to Austin in Texas in Arms headquarters in Cambridge as well as in the beautiful Sophia-Antipolis near Nice, where Texas Instruments designed its OMAP processors ten years ago. The generations Cortex-A73 and -75 both came from France, while the energy-saving versions Cortex-A53 and -A55 came from Cambridge. The last Austin processor was the Cortex-A76, which is used in smartphone SoCs as well as in automotive SoCs in the AE version. Following the Cortex-A76 CPU and Neoverse N1 infrastructure platform, led by Mike Filippo, his Austin-based colleague Chris Abernathy (Fig. 1) was responsible for the Cortex-A77 project, code-named "Deimos".

Since the A76's chief architect, Mike Filippo, is now responsible for the Neoverse infrastructure CPUs, the Cortex-A77 project, code-named "Deimos", was led by his Austin-based colleague Chris Abernathy (Fig. 1).

While the A76 after four years of work had a completely new microarchitecture implemented, the Cortex-A77 builds on the Cortex-A76, which is not surprising: Filippo had already announced in 2018 that the A76 would be the basis for at least two more CPU generations, the Cortex-A77 is number 1.

Compared to its predecessor Cortex-A75, the Cortex-A76 achieved a 35% increase in integer computing power in the well-known benchmark SPECInt_2006, which was also accepted by Intel. However, the Cortex-A75 could only be clocked at 2.8 GHz in a 10 nm process, while the target process of the Cortex-A76 was based on 7 nm and a clock frequency of 3 GHz. The Cortex-A77 now achieves a 23% increase in integer computing power in the same 7nm process at the same clock frequency, which was achieved exclusively through improvements to the microarchitecture. The floating point computing power even increased by 35 % (SPECFP_2006) and 25 % (SPECFP_2017), the energy efficiency is similar to that of the Cortex-A76. All values were measured for one core, i.e. the single-thread computing power. The silicon area is about 17% larger than that of the Cortex-A76, the reasons for this will follow later.

The L1 caches for instructions and data are 64 KB in size and the L2 cache per core can be configured to 256 KB or 512 KB. A shared L3 cache of up to 4 MB can be designed for this purpose. Together with a "small" Cortex-A55 the Cortex-A77 can be operated in a DynamIQ-Custer.

The microarchitecture at a glance

The superscalar out-of-order CPU now comes with six instruction decoders instead of four and 10 execution units instead of eight (Figure 2). In the frontend, Arm has re-introduced its branch prediction and instruction loading unit called predict-directed fetch, as the branch prediction feeds data directly into the command retrieval unit. This is an approach that results in higher data throughput and lower power consumption.

The branch prediction uses a hybrid indirect predictor. The predictor is decoupled from the fetch unit and its most important structures operate independently of the rest of the CPU. It is only supported by two-stage caches instead of the three-stage branch-target caches of the A76: an L1-BTB now comprising 64 instead of 16 entries (for the Cortex-A76 this was still called nanoBTB) with one clock cycle latency and a main BTB now comprising 8000 instead of 6000 entries. The 64 entries microBTB of the Cortex-A76 as 2nd hierarchical level could logically be omitted due to the enlarged L1-BTB.

The hit rate regarding correctly predicted branches in the program flow is said to have increased again compared to the Cortex-A76, in addition the bandwidth of the branch prediction was doubled to 64 byte/clock cycle, which means, in that the prefetcher can fetch up to 16 instructions per clock cycle (Arms instructions are 32 bit wide, Thumb instructions only 16 bit) which enables earlier prefetching at L1 cache misses and prevents the dreaded bubbles (=pipeline stalls) at started branches in the case of the assumption that no branch is made.

A novelty is the additionally introduced Macro-Op-Cache (Mop), which represents a 1.5 K entries (approx. 50 KB) instruction cache for decoded instructions, the so-called Macro-Ops, and reminds of similar implementations in Intel and AMD x86 CPUs. In the case of a cache hit, the rename stage in the pipeline is fed directly from the Mop, so it is a kind of L0 cache and its hit rate is supposed to be more than 85% according to Arm. The latency time in case of a jump instruction prediction is in the best case only 10 clock cycles instead of 11 clock cycles as with the Cortex-A76, a sensational value, which is e.g. 30 % below the 16 clock cycles of Intel's Skylake CPUs or Samsung's Armv8-based CPU called M3. The mop occupies about half the area of the 64 K L1 cache.

The branch unit can process eight 16-bit instructions per clock cycle, which end in a fetch queue before loading an instruction. This queue contains twelve blocks. The fetch unit itself only works at half the data throughput, so a maximum of four 16-bit instructions are loaded per clock cycle. In the event of an incorrectly predicted branch, this architecture can hide it from the rest of the pipeline without blocking the fetch unit and the rest of the CPU.

The decoding and register renaming blocks can process six per clock cycle, wider than the A76 with four and even wider than the Cortex-A73 with two and the Cortex-A75 with three instructions. A port sharing mechanism has been implemented to prevent an excessive increase in power consumption.

The output of the decoders contains the macro ops, which are on average 1.06 times larger than the original commands according to Arm. Register renaming is done separately for integer/ASIMD/flag operations in separate units that are clock-gated from the clock supply when not needed. This leads to enormous energy savings. If the decoding of the Cortex-A77 requires two clock cycles, the register renaming requires only one clock cycle. The macro ops are extended with a ratio of 1.2 micro ops per instruction to micro ops, at the end of the stage up to 10 micro ops per clock cycle are output, which represents an increase of 25%, 67% and 150% respectively compared to Cortex-A76, -A73 and -A75 (8 micro ops/clock cycle for Cortex-A76, 6 micro ops/clock cycle for Cortex-A75 and 4 micro ops/clock cycle for Cortex-A73).

The Cortex-A77's commit buffer size is now 160 instead of 128 entries, with the buffer divided into two structures for instruction management and register recovery - Arm calls it a hybrid commit system. This allows even greater code parallelism.

The execution units

As before, the integer part contains two load/storage units, but two instead of one pipeline for branching, which makes sense in view of the doubling of the branch prediction bandwidth in the front end.

There are now three ALUs instead of two that are capable of performing simple arithmetic operations in one clock cycle and more complex operations such as logic operations, test/compare operations, or shift operations in two clock cycles, and like the A76, a complex pipeline with multiplication, division, and CRC operations.

The latency of integer multiplications of the complex ALU has been reduced from 4 to 2-3 clock cycles. This together leads to a 50% increase in throughput in the integer part. The part responsible for floating point and vector operations (ASIMD) contains two pipelines as for the A76, a second pipeline for AES encryption was added.

Loading and storing data

The data cache is fixed at 64 KB and is 4-fold associative. The latency time remains at four cycles. The 64 KB L1 command cache reads up to 32 bytes/clock cycle, as does the L1 data cache in both directions. The L1 is a writeback cache. The L2 cache is configurable in 256 or 512 KB size and has the same 2 × 32 byte/clock cycle write and read interfaces to the DSU L3 cache on the data side. A window up to 25% larger for in-flight loading and storage operations (68/72 for the Cortex-A76) provides even more parallelism at the memory level. The bandwidth of load and storage operations could be doubled by having the two store data pipelines now with dedicated execution ports, previously shared with the ALUs. As you can see in figure 2, two µOps can be processed in parallel for address generation and 2 µOps for data storage. The command queues have been unified to maintain energy efficiency.

The Cortex-A77 got a data prefetcher with new engines for higher accuracy, optimized again compared to the Cortex-A76, it can also work longer maximum address distances to allow higher bandwidth utilization in the DRAM. This means that prefetchers can recognize repetitive access patterns across larger virtual address spaces than before.

System-based prefetching means improved tolerance to different memory subsystem implementations, dynamic spacing for different latencies, and dynamically adjusted aggressiveness based on DynamIQ L3 utilization.

In general, Arm promises an increase in IPC (instructions per clock cycle) of 23% for integer instructions and 35% for floating/vector operations. The latter is surprising as Arm has left the floating point pipelines in the backend unchanged from the Cortex-A76. The only explanation I can think of is that SPECInt FP is much more memory intensive than its integer counterpart, so this is where the memory system improvements come in.

All in all, the Cortex-A77 is expected to deliver a 20% higher GeekBench4 benchmark performance and also a 20% increase in LMBENCH (average of reading and writing) (Figure 3).