New RISC-V processor family

From microcontroller to multicore processor

10. November 2021, 11:30 Uhr | By Frank Riemenschneider, Segger Microcontroller

Fortsetzung des Artikels von Teil 1

New CPU in Series 7

The new dual-issue Series 7 CPU represents a departure from SiFive's previous CPUs: Series 5 uses a simple five-stage scalar pipeline, implemented with TSMC's 28 nm process, a clock frequency of up to 1.5 GHz is achievable. The S54 includes the RV64I base ISA as well as the Multiply and Divide (M), Atomic (A) and Compressed (C) extensions. Optionally, the Series 5 processor handles the single-precision (F) and double-precision (D) floating-point extensions.

Die 8-stufige Dual-Issue-Pipeline des SiFive S7-Cores
Figure 3. The 8-stage dual-issue pipeline of the SiFive S7 core.
© Segger Microcontroller

As Figure 3 shows, the Series 7 processor expands the pipeline to eight stages and adds several execution units for superscalar operations. The first execution slot performs memory operations (load/store) and simple integer operations, whereas the second slot performs arbitrary integer operations – including multiply/divide – branch resolution, and floating-point operations. SiFive has added a second fetch stage and a second data memory access stage to allow for larger L1 cache and scratchpad memories. A second decode stage handles superscalar dispatching.

Both execution slots contain arithmetic logic execution units (ALUs) in the 5th stage. They handle most of the arithmetic instructions. Branch resolution can use these ALUs immediately, resulting in five clock cycles of latency if a branch miss occurs. However, when an ALU instruction needs the output of a pending load, it moves to stage seven, which contains a second set of ALUs. These »late« ALUs provide a load-to-use latency of zero cycles, meaning that a dependent ALU instruction can be processed in the cycle immediately following the instruction that loads its data. When a branch is resolved with the late ALUs, the latency increases to seven clock cycles for a jump miss.

Mikroarchitektur der S-CPU von SiFive
Figure 4. Microarchitecture of SiFive's S-CPU.
© Segger Microcontroller

The biggest change in the 7 Series is the revision of the memory subsystem with data cache and optional tightly integrated memory (TIM). The FIO port bypasses the core complex bus. Figure 4 shows the structure of the CPU.

The U74MC has a 64-bit register set and 64-bit data path, L1 instruction and data caches protected by ECC, a physical memory protection (PMP) unit, and a memory management unit (MMU) that enables the use of Linux. The MMU implements the 39-bit version (SV39) of the RISC-V virtual memory system. The PMP protects up to eight memory areas and allows permissions to be assigned for user-mode accesses. The processor core can also contain a local interrupt controller (CLIC) to enable interrupt prioritization and preemption. To prevent side-channel attacks, system software can clear branch history when switching processes.

The 32-bit E76 and 64-bit S76 are microcontroller-class CPUs that lack the MMU compared to the U74, but include optional tightly integrated memory (TIM) and the FIO. SiFive configures the E7x cores with a 64-KB instruction cache with four-way associativity, an instruction TIM addressable in a single cycle, or both.

For data, a cache or TIM can be selected. Although the data TIM ranges from 4 KB to 256 KB, most developers opt for 32 KB. For real-time capable processors, developers can use instruction TIM and disable dynamic jump prediction at boot time. These processor cores typically run a real-time operating system (RTOS) and small applications, so complex cache structures are not required.

Vector unit for SiFive's S7

VIS7 was unveiled in early 2021, a processor capable of 64 billion FP32 operations per second and designed for deterministic operations. VIS7 combines the S7 processor with a 512-bit wide vector unit.

The new VIS7 processor includes the vector extension RVV 1.0. The vector unit works with 8-, 16- and 32-bit data in floating-point, fixed-point and integer formats. It uses a 512-bit vector ALU and a 512-bit vector memory unit.

The VIS7 can be compared to the Cortex-R82, Arm's first 64-bit real-time processor. To increase the SIMD performance of the R82, licensees can integrate an optional 128-bit wide Neon unit. The VIS7 and R82 have eight cores and offer real-time determinism to deliver predictable throughput. Both processors use tightly integrated or tightly coupled memory to reduce memory transaction times and improve determinism.

SiFive's VIS7 processor achieves 5.1 CoreMarks/MHz, 12 % behind Arm's R82, and it operates at a similar peak frequency of 2.0 GHz. However, it shines in SIMD operations as its vector unit is 4× wider and almost quadruples the FP32 peak throughput of the R82's Neon unit. Programmers can also set LMUL (length-multiplier) – a control register for grouping vector registers - to 2, 4 or even 8, creating an 8,182-bit wide virtual register in extreme cases. LMUL does not improve peak throughput, but it does reduce the number of instructions needed to supply the vector unit.

 

The core complex includes a fully coherent and shared memory area. A platform-level interrupt controller (PLIC) distributes global interrupts. Each processor core can be configured, for example, one with SRAM, another with an accelerator, and a third core without either. All processor cores are connected to a cache-coherent bus and can see and access the FIO port on all other cores, which means they can also access the SRAM and a possible user-defined accelerator of the other cores.

For a simple microcontroller, instead of a multi-core configuration, only an E76 core with TCM, FIO and CLIC (Core-Local Interrupt Controller) functions can also be used and the L2 cache and PLIC block can be omitted. Excluding memory, such a microcontroller occupies 0.112 mm2 of silicon area in TSMC's 28HPC process when using a standard 9-track cell library. According to SiFive, this microcontroller consumes 20.4 mW when running the Dhrystone benchmark on it at 400 MHz clock frequency – without memory. For maximum performance, a 12-track library should allow worst-case operation at 875 MHz, with the processor core occupying 0.174 mm2 and consuming 74.4 mW of power.


  1. From microcontroller to multicore processor
  2. New CPU in Series 7
  3. SiFive vs. Arm

Das könnte Sie auch interessieren

Verwandte Artikel

SEGGER Microcontroller GmbH & Co. KG

Matchmaker+