RISC-V 和嵌入式系统¶

RISC-V是重塑芯片行业的开源指令集架构。该文件涵盖了 RISC-V 理念、V vector extension、嵌入式 ML inference、微控制器上的 TinyML、AI 加速器中的 RISC-V 以及边缘部署限制

到目前为止，我们介绍的每个芯片架构（x86、ARM）都需要许可证。 Intel 和 AMD 支付 x86 的费用。苹果、高通和每个智能手机供应商每年支付 ARM 数十亿美元。 RISC-V 不同：它是一个开放标准。任何人都可以设计、制造和销售 RISC-V 芯片，而无需向任何人支付专利费。这正在改变芯片设计的经济性，尤其是人工智能。

RISC-V 理念¶

RISC-V（发音为“risk Five”）于 2010 年在加州大学伯克利分校创建，是一个干净、现代的 RISC 指令集。关键原则：
- 开放标准：ISA 规范可免费获得。您可以构建 RISC-V CPU，无需许可费用、保密协议或法律协议。这就像 Linux 之于操作系统一样——任何人都可以使用、修改和构建它。
- 模block化设计：基础 ISA（RV32I 或 RV64I）非常小 — 只有 47 条指令。其他一切都是可选的 extensions：M（乘法/除法）、A（原子运算）、F/D（浮点）、C（压缩指令）、V（vector 处理）。您只选择您需要的，保持芯片小而高效。
- 没有遗留包袱：x86 具有 45 年的向后兼容性。 ARM承载35年。 RISC-V 开始干净，吸收了两者的经验教训。没有仅为了与 20 世纪 80 年代的软件兼容而存在的晦涩说明。
谁使用 RISC-V：SiFive（通用核心）、阿里巴巴（玄铁服务器核心）、西部数据（存储控制器，出货量数十亿）、Espressif（ESP32-C3，流行的 IoT 芯片）以及数十家使用 RISC-V 作为管理其自定义计算单元的控制处理器的 AI 加速器初创公司。

RISC-V基础架构¶

基本整数 ISA（64 位的 RV64I）具有：
- 32 个通用 registers（x0-x31，每个 64 位）。 x0 硬连线为零（对于在没有特殊指令的情况下实现常见模式很有用）。
- 固定 32 位指令宽度（C extension 添加 16 位压缩指令以提高代码密度）。
- 加载-存储架构：与ARM一样，算术运算仅在registers上进行。内存访问是通过显式加载/存储指令进行的。

# RISC-V assembly (for flavour — you will use C/C++)
add  x3, x1, x2      # x3 = x1 + x2
lw   x4, 0(x5)       # load word from address in x5
sw   x4, 8(x5)       # store word to address x5 + 8
beq  x1, x2, label   # branch if x1 == x2

ISA 的简单性使得 RISC-V kernel体积小且节能。最小的 RV32I kernel可在约 10,000 个门中实现（ARM Cortex-M0 约为 12,000 个）。这对于嵌入式系统来说很重要，其中每一毫瓦和每一平方毫米的硅都很重要。

V 扩展：RISC-V 矢量处理¶

V extension (RVV) 在 RISC-V 的基础上添加了可扩展的 vector 处理，类似于 ARM SVE。矢量 registers 具有由硬件指定的可配置长度 (VLEN)（128 至 65,536 位）。代码编写为 vector 长度不可知：它可以在任何 VLEN 上运行，无需重新编译。

#include <riscv_vector.h>

// Vector addition using RVV intrinsics
void vadd_rvv(const float* a, const float* b, float* c, int n) {
    while (n > 0) {
        // vsetvl: set vector length — processes min(n, hardware_max) elements
        size_t vl = __riscv_vsetvl_e32m1(n);

        // Load vl elements
        vfloat32m1_t va = __riscv_vle32_v_f32m1(a, vl);
        vfloat32m1_t vb = __riscv_vle32_v_f32m1(b, vl);

        // Add
        vfloat32m1_t vc = __riscv_vfadd_vv_f32m1(va, vb, vl);

        // Store
        __riscv_vse32_v_f32m1(c, vc, vl);

        // Advance pointers
        a += vl; b += vl; c += vl; n -= vl;
    }
}

vsetvl是关键指令。它告诉硬件“我想处理这么多元素”，硬件响应“我可以处理这么多元素”（由 VLEN 限制）。该循环自动适应任何 vector 宽度，无需标量清理（最后一次迭代仅处理较少的元素）。
LMUL（长度乘数）：RVV 可以将多个 vector registers 组合在一起（m1、m2、m4、m8），以减少可用 registers 的成本，以每条指令处理更多元素。 m1 每个 vector 操作数使用一个 register； m8 使用 8 个，处理量增加了 8 倍，但仅留下 4 个 register 组可用。
与x86 AVX（固定256/512位）和ARM NEON（固定128位）相比，RVV的可扩展性是针对不同硬件的主要优势：相同的代码运行在微型嵌入式kernel（VLEN=128）和高性能服务器kernel（VLEN=1024+）上。

嵌入式机器学习：TinyML¶

TinyML 是微控制器上的机器学习——具有千字节 RAM、兆赫级 CPU 和毫瓦功率预算的设备。想一想：检测关键词（“Hey Siri”）的传感器、对手势进行分类的加速计或对人数进行统计的摄像头，所有这些都在成本为 0.50 美元的芯片上运行，没有互联网连接。
限制是极端的：

资源	服务器 GPU	手机	微控制器
内存	80GB	6GB	256KB
贮存	结核病	128GB	1MB
计算	1000 万亿次浮点运算	10 万亿次浮点运算	0.001 万亿次浮点运算
力量	700瓦	5瓦	0.001瓦
成本	$30,000	$500	$1

适用于服务器 GPU（$O(10^{10})$ 参数）的 model 不适用于 microcontroller。 TinyML models 具有 $O(10^4)$–$O(10^6)$ 参数，并使用 INT8 甚至 INT4 量化。

TensorFlow Lite Micro (TFLM)¶

TFLM 是 Google 的微控制器 inference 框架。它运行量化的 TensorFlow Lite models，无需动态内存分配，无需操作系统，二进制占用空间约为 20 KB。

// TinyML inference on a microcontroller (simplified)
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"

// Model is compiled into a C array (const unsigned char model_data[])
const tflite::Model* model = tflite::GetModel(model_data);

// Allocate a fixed memory arena (no malloc!)
constexpr int kArenaSize = 10 * 1024;  // 10 KB
uint8_t tensor_arena[kArenaSize];

// Set up interpreter
tflite::MicroInterpreter interpreter(model, resolver, tensor_arena, kArenaSize);
interpreter.AllocateTensors();

// Set input
float* input = interpreter.input(0)->data.f;
input[0] = sensor_reading;

// Run inference
interpreter.Invoke();

// Read output
float* output = interpreter.output(0)->data.f;
if (output[0] > 0.8f) {
    trigger_alert();
}

此代码中的关键约束：
- tensor_arena 是静态分配的 — 没有 malloc，没有堆。嵌入式系统通常没有动态内存分配器。
- model 是一个 const 字节数组，存储在闪存 (ROM) 中，而不是从 file system 加载。
- 整个框架+model+运行时只有几十KB。

边缘model优化¶

让 model 在 microcontroller 上运行需要积极的优化：
- 量化（第 18 章）：将 float32 权重转换为 INT8（在纯整数硬件上小 4 倍，快 2-4 倍）。 training 后量化很简单；量化感知 training 保持更高的准确性。
- 修剪：去除接近于零的权重。结构化修剪（删除整个通道/头）比非结构化修剪（随机零）对硬件更友好，因为它减少了实际计算，而不仅仅是存储。
- 知识蒸馏（第6章）：training一个小“学生”model来模仿一个大“老师”model。学生从头开始获得了比 training 更高的准确率，因为它从老师的软预测中学习。
- 神经架构搜索 (NAS)：自动搜索适合硬件预算（延迟、内存、功耗）的高效架构。 MicroNets 和 MCUNet 找到针对特定微控制器优化的架构。
- 算子融合：将 conv + 批量归一化 + ReLU merge为单个融合操作，消除中间内存写入（与 GPU kernel 融合原理相同，但当您拥有 256 KB RAM 时更为关键）。

AI加速器中的RISC-V¶

许多人工智能加速器初创公司使用 RISC-V 并不是为了直接运行 ML models，而是作为管理自定义计算单元的控制处理器：

┌─────────────────────────────────────────┐
│              AI Accelerator             │
│                                         │
│  ┌──────────┐    ┌──────────────────┐   │
│  │  RISC-V  │───→│  Custom Matrix   │   │
│  │  Control │    │  Multiply Unit   │   │
│  │  Core    │    │  (systolic array,│   │
│  │          │    │  custom dataflow)│   │
│  └──────────┘    └──────────────────┘   │
│       │                    │            │
│       ▼                    ▼            │
│  ┌──────────┐    ┌──────────────────┐   │
│  │  Memory  │    │  On-chip SRAM    │   │
│  │  Control │    │  (activation     │   │
│  │          │    │   buffer)        │   │
│  └──────────┘    └──────────────────┘   │
└─────────────────────────────────────────┘

RISC-V kernel处理：从外部存储器加载 model 权重、调度层执行、管理计算单元之间的数据流以及与主机通信（通过 PCIe、USB 或 SPI）。繁重的计算（矩阵乘法、卷积）是由定制硬件完成的，而不是 RISC-V kernel。
为什么选择 RISC-V 进行控制：无许可成本（对于初创公司至关重要）、可定制（添加特定于域的指令）、占用空间小（控制核心不需要 x86 的复杂性），并且开放的生态系统可实现快速原型设计。
示例：Esperanto Technologies（用于 ML 的 1000 多个 RISC-V 核心）、Tenstorrent（RISC-V 控制 + 自定义 Tensix 核心）、SiFive（RISC-V 核心和用于边缘 ML 的 vector extensions）。

边缘部署限制¶

在边缘（设备上，而不是云中）部署 ML 会带来云部署不会面临的限制：
功率：电池供电设备的总功率预算可能为 100 mW。运行功耗为 50 mW 的 model 时，系统的其余部分（传感器、无线电、显示器）仅剩下 50 mW。功率感知 inference 安排计算以避免热节流并延长电池寿命。
延迟：边缘 inference 通常必须是实时的。唤醒词检测器（“Hey Siri”）必须在约 200 毫秒内做出响应。自动驾驶感知系统（第 11 章）必须在约 30 毫秒内处理帧。对于这些用例来说，到云的网络往返（50-200 毫秒）太慢。
隐私：在设备上处理数据意味着敏感数据（医学图像、录音、个人照片）永远不会离开设备。这是某些司法管辖区 (GDPR) 的法律要求，也是所有地方的用户信任要求。
连接：边缘设备可能具有间歇性或无互联网连接。在火星探测器（第 11 章）、潜艇或农村农场传感器上运行的 model 必须完全离线工作。
大规模成本：将 ML 部署到 10 亿部智能手机，每台设备的成本为 0 美元（硬件已存在）。部署 10 亿个 IoT 传感器意味着每个传感器的 ML 硬件预算仅为几美分。 RISC-V 的零许可成本在这种规模下非常重要。

编码任务（使用 g++ 或 riscv64-gcc 交叉编译器编译）¶

编写一个模拟 TinyML inference 管道的 C 程序：静态分配 model 缓冲区，运行模拟前向传递，并测量资源使用情况。这教导了嵌入式约束（无 malloc、固定内存区域）。

// task1_tinyml_sim.cpp
// Compile: g++ -O2 -o task1 task1_tinyml_sim.cpp

#include <iostream>
#include <chrono>
#include <cmath>
#include <cstring>

// Simulate a microcontroller: fixed memory arena, no dynamic allocation
static constexpr int ARENA_SIZE = 32 * 1024;  // 32 KB total RAM budget
static uint8_t arena[ARENA_SIZE];

// Simple 2-layer MLP: 784 -> 64 -> 10 (MNIST-like, INT8 weights)
struct TinyModel {
    int8_t w1[784 * 64];      // layer 1 weights: 50,176 bytes
    int8_t b1[64];             // layer 1 biases
    int8_t w2[64 * 10];       // layer 2 weights: 640 bytes
    int8_t b2[10];             // layer 2 biases
    // Total: ~51 KB → must go in flash (ROM), not RAM
};

// Check if model fits in flash
void check_model_fit(int flash_kb) {
    int model_bytes = sizeof(TinyModel);
    std::cout << "Model size: " << model_bytes << " bytes ("
              << model_bytes / 1024 << " KB)\n";
    std::cout << "Flash: " << flash_kb << " KB → "
              << (model_bytes <= flash_kb * 1024 ? "FITS" : "TOO LARGE") << "\n";
}

// Mock inference using the fixed arena for activations
void mock_inference(const int8_t* input, int8_t* output) {
    // Activations go in the arena (RAM), not allocated dynamically
    int8_t* act1 = (int8_t*)arena;            // 64 bytes for layer 1 output
    int8_t* act2 = (int8_t*)(arena + 64);     // 10 bytes for layer 2 output

    // Layer 1: simplified matmul (not real quantised matmul, just structure demo)
    for (int j = 0; j < 64; j++) {
        int32_t sum = 0;  // accumulate in int32 to avoid overflow
        for (int i = 0; i < 784; i++) {
            sum += (int32_t)input[i] * 1;  // mock: weight = 1
        }
        act1[j] = (int8_t)std::max(-128, std::min(127, sum / 784));  // quantise back
        act1[j] = act1[j] > 0 ? act1[j] : 0;  // ReLU
    }

    // Layer 2
    for (int j = 0; j < 10; j++) {
        int32_t sum = 0;
        for (int i = 0; i < 64; i++) {
            sum += (int32_t)act1[i] * 1;
        }
        act2[j] = (int8_t)std::max(-128, std::min(127, sum / 64));
    }

    std::memcpy(output, act2, 10);
}

int main() {
    std::cout << "=== TinyML Resource Budget ===\n";
    std::cout << "Arena (RAM): " << ARENA_SIZE << " bytes ("
              << ARENA_SIZE / 1024 << " KB)\n";
    check_model_fit(256);  // typical MCU flash

    // Activation memory used
    int activation_bytes = 64 + 10;  // layer 1 + layer 2 outputs
    std::cout << "Activation memory: " << activation_bytes
              << " bytes / " << ARENA_SIZE << " available\n\n";

    // Benchmark inference
    int8_t input[784];
    int8_t output[10];
    std::memset(input, 1, 784);

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 10000; i++) {
        mock_inference(input, output);
    }
    auto end = std::chrono::high_resolution_clock::now();
    double us = std::chrono::duration<double, std::micro>(end - start).count() / 10000;

    std::cout << "Inference latency: " << us << " us\n";
    std::cout << "At 160 MHz MCU (~6.25 ns/cycle): ~"
              << (int)(us * 160) << " cycles\n";

    std::cout << "Output logits: ";
    for (int i = 0; i < 10; i++) std::cout << (int)output[i] << " ";
    std::cout << "\n";

    return 0;
}

编写一个 C++ 程序，将 float32 权重量化为 INT8，并测量压缩比和量化误差。

// task2_quantise.cpp
// Compile: g++ -O3 -o task2 task2_quantise.cpp

#include <iostream>
#include <vector>
#include <cmath>
#include <algorithm>
#include <numeric>

// Symmetric quantisation: map float range [-max, +max] to [-127, +127]
void quantise_symmetric(const float* input, int8_t* output, int n, float& scale) {
    float max_val = 0.0f;
    for (int i = 0; i < n; i++) {
        max_val = std::max(max_val, std::abs(input[i]));
    }
    scale = max_val / 127.0f;
    for (int i = 0; i < n; i++) {
        float scaled = input[i] / scale;
        output[i] = (int8_t)std::max(-127.0f, std::min(127.0f, std::round(scaled)));
    }
}

// Dequantise: INT8 back to float
void dequantise(const int8_t* input, float* output, int n, float scale) {
    for (int i = 0; i < n; i++) {
        output[i] = (float)input[i] * scale;
    }
}

int main() {
    const int N = 100000;

    // Simulate random weights (roughly normal distribution)
    std::vector<float> weights(N);
    for (int i = 0; i < N; i++) {
        // Simple pseudo-random normal-ish values
        float u1 = (float)(i * 7 % 997 + 1) / 998.0f;
        float u2 = (float)(i * 13 % 991 + 1) / 992.0f;
        weights[i] = std::sqrt(-2.0f * std::log(u1)) * std::cos(6.2832f * u2) * 0.1f;
    }

    // Quantise
    std::vector<int8_t> quantised(N);
    float scale;
    quantise_symmetric(weights.data(), quantised.data(), N, scale);

    // Dequantise and measure error
    std::vector<float> reconstructed(N);
    dequantise(quantised.data(), reconstructed.data(), N, scale);

    float max_error = 0.0f, total_error = 0.0f;
    for (int i = 0; i < N; i++) {
        float err = std::abs(weights[i] - reconstructed[i]);
        max_error = std::max(max_error, err);
        total_error += err;
    }

    std::cout << "=== Quantisation Results ===\n";
    std::cout << "Original:    " << N * 4 << " bytes (float32)\n";
    std::cout << "Quantised:   " << N * 1 << " bytes (int8) + 4 bytes (scale)\n";
    std::cout << "Compression: " << 4.0f << "x\n";
    std::cout << "Scale factor: " << scale << "\n";
    std::cout << "Mean abs error: " << total_error / N << "\n";
    std::cout << "Max abs error:  " << max_error << "\n";
    std::cout << "Max abs error / scale: " << max_error / scale
              << " (should be <= 0.5 quantisation levels)\n";

    return 0;
}

编写一个 C++ 程序，执行 INT8 矩阵乘法和 INT32 累加——在嵌入式 ML 加速器上运行的实际计算。

// task3_int8_matmul.cpp
// Compile: g++ -O3 -o task3 task3_int8_matmul.cpp

#include <iostream>
#include <chrono>
#include <vector>
#include <cstdint>

// INT8 matmul with INT32 accumulation (what Tensor Cores and MCU accelerators do)
void matmul_int8(const int8_t* A, const int8_t* B, int32_t* C,
                 int M, int N, int K) {
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            int32_t sum = 0;
            for (int k = 0; k < K; k++) {
                sum += (int32_t)A[i * K + k] * (int32_t)B[k * N + j];
            }
            C[i * N + j] = sum;
        }
    }
}

// Float32 matmul for comparison
void matmul_f32(const float* A, const float* B, float* C,
                int M, int N, int K) {
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0.0f;
            for (int k = 0; k < K; k++) {
                sum += A[i * K + k] * B[k * N + j];
            }
            C[i * N + j] = sum;
        }
    }
}

int main() {
    const int M = 128, N = 128, K = 128;

    std::vector<int8_t> A_i8(M * K, 1), B_i8(K * N, 1);
    std::vector<int32_t> C_i32(M * N);

    std::vector<float> A_f32(M * K, 1.0f), B_f32(K * N, 1.0f);
    std::vector<float> C_f32(M * N);

    // Benchmark INT8
    auto start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < 100; t++) {
        matmul_int8(A_i8.data(), B_i8.data(), C_i32.data(), M, N, K);
    }
    auto end = std::chrono::high_resolution_clock::now();
    double i8_ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;

    // Benchmark FP32
    start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < 100; t++) {
        matmul_f32(A_f32.data(), B_f32.data(), C_f32.data(), M, N, K);
    }
    end = std::chrono::high_resolution_clock::now();
    double f32_ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;

    double gflops_i8 = 2.0 * M * N * K / i8_ms / 1e6;
    double gflops_f32 = 2.0 * M * N * K / f32_ms / 1e6;

    std::cout << "INT8 matmul:  " << i8_ms << " ms (" << gflops_i8 << " GOPS)\n";
    std::cout << "FP32 matmul:  " << f32_ms << " ms (" << gflops_f32 << " GFLOPS)\n";
    std::cout << "INT8 speedup: " << f32_ms / i8_ms << "x\n";
    std::cout << "Memory: INT8 = " << M*K + K*N << " bytes vs FP32 = "
              << (M*K + K*N) * 4 << " bytes (4x less)\n";

    return 0;
}