GPU架构和CUDA¶

GPU 通过提供数千个kernel来实现大规模并行性，从而改变了人工智能。该文件涵盖了 GPU 与 CPU 设计理念、GPU 内存层次结构、C++ 中的 CUDA 编程、SIMT 执行 model、内存访问模式、synchronisation、流、profiling 和 NVIDIA GPU几代产品，编写和理解 GPU kernels 所需的知识。

有关包含完整工作示例的 CUDA 实践教程，请参阅配套的 repository：github.com/HenryNdubuaku/cuda-tutorials。
现代 NVIDIA GPU 拥有超过 10,000 个 CUDA kernel。 CPU有4-128个核心。这种 100-1000 倍的核心优势是 GPU 主导 ML 的原因：training 和 transformer 需要数万亿次乘加运算，而 GPU 并行处理这些运算的规模是 CPU 无法比拟的。
即使您自己从未编写过 CUDA kernels，理解 GPU 架构也会解释：为什么批量大小很重要（需要足够的工作来使 GPU 饱和），为什么内存通常是瓶颈（而不是计算），以及为什么某些操作（分散、条件branch）在 GPU 上速度很慢。

GPU 与 CPU：根本不同的设计¶

CPU 专为延迟而设计：最大限度地缩短完成一项任务的时间。它将大部分晶体管预算用于缓存、branch 预测器和乱序执行——所有这些技巧都使 thread 更快。
GPU 专为吞吐量而设计：最大化每秒完成的任务数量。它将大部分晶体管用于执行单元 (ALU)。单个threads很慢，但有数千个。

	CPU	GPU
核心	4-128（复杂、快速）	1,000-20,000（简单，慢）
时钟速度	3-5GHz	1-2.5GHz
缓存	大（32 MB+ L3）	小型（每 SM shared memory）
branch预测	复杂的	无（所有 threads 都遵循相同的路径）
最适合	低延迟、复杂的控制流程	高吞吐量、数据并行工作
典型浮点运算次数 (FP32)	1-5 万亿次浮点运算	30-80 万亿次浮点运算
内存带宽	50-100GB/秒	1-3TB/秒

GPU 的内存带宽优势（10-30 倍）通常比其计算优势更重要。许多 ML 操作都是受内存限制的（逐元素操作、归一化、attention），而 GPU 的带宽使其能够足够快地向其核心提供数据。

GPU 内存层次结构¶

了解 GPU 内存至关重要，因为内存访问是主要瓶颈，而不是计算。

记忆	尺寸	延迟	带宽	范围
寄存器	每个 SM 约 256 KB	0 周期	最高	每 thread
shared memory	每个 SM 48-228 KB	〜5个周期	约 20 TB/秒	每 thread block
L1 cache	每个 SM 128-256 KB	〜30个周期		每SM
L2 cache	4-96MB	约 200 个周期	~6TB/秒	全球的
global memory（HBM）	24-192GB	约 400 个周期	1-3.3TB/秒	全球的

寄存器是最快但最有限的。每个 thread 都有一组专用的 registers（通常最多 255 个）。每个 thread 使用过多的 registers 会减少 occupancy（可以同时运行的 threads 更少）。
shared memory是程序员管理的缓存，由 block 中的所有 threads 共享。这是编写快速 CUDA kernels 的关键：将一block数据从慢速 global memory 加载到快速 shared memory，然后对其进行计算。这是主导 GPU 编程的 tiling 模式。
global memory (HBM)：main GPU 内存 (VRAM)。大但慢（400 个周期延迟）。所有数据都从这里开始和结束。 kernel 优化的目标是最小化 global memory 访问。

CUDA 编程model¶

CUDA（计算统一设备架构）是NVIDIA针对GPU的编程model。您编写kernels：在GPU上运行的函数，由数千个threads同时执行。

层次结构：grid、block、thread¶

Grid (the entire launch)
├── Block (0,0)
│   ├── Thread (0,0)
│   ├── Thread (1,0)
│   ├── Thread (2,0)
│   └── ... (up to 1024 threads per block)
├── Block (1,0)
│   ├── Thread (0,0)
│   └── ...
└── ... (millions of blocks possible)

thread：最小单位。每个 thread 在其 block 内都有一个唯一的 ID (threadIdx.x)。
block：一组可以shared memory并synchronisation的threads。区blockID：blockIdx.x。block大小：blockDim.x（最多1024 threads）。
grid：所有blocks均由单个kernel发起。可以是 1D、2D 或 3D。
每个thread计算其全局索引：int idx = blockIdx.x * blockDim.x + threadIdx.x;

您的第一个 CUDA kernel¶

// vector_add.cu — CUDA source file (.cu extension)

#include <stdio.h>

// __global__ marks this as a GPU kernel (called from CPU, runs on GPU)
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {           // bounds check (grid may be larger than data)
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1 << 20;  // ~1 million elements
    size_t bytes = n * sizeof(float);

    // Allocate host (CPU) memory
    float *h_a = new float[n];
    float *h_b = new float[n];
    float *h_c = new float[n];

    // Initialise
    for (int i = 0; i < n; i++) {
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;
    }

    // Allocate device (GPU) memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data from CPU to GPU
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 256 threads per block, enough blocks to cover n elements
    int block_size = 256;
    int grid_size = (n + block_size - 1) / block_size;  // ceiling division
    vector_add<<<grid_size, block_size>>>(d_a, d_b, d_c, n);

    // Copy result from GPU to CPU
    cudaMemcpy(h_c, d_a, bytes, cudaMemcpyDeviceToHost);

    // Verify
    printf("c[0] = %f (expected 3.0)\n", h_c[0]);

    // Free memory
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    delete[] h_a; delete[] h_b; delete[] h_c;

    return 0;
}

# Compile with NVIDIA's compiler
nvcc -O3 -o vector_add vector_add.cu
./vector_add

CUDA 中的关键 C++ 概念：
- __global__：标记kernel函数的CUDA关键字。从 CPU (host) 调用，在 GPU (device) 上运行。
- <<<grid_size, block_size>>>：kernel 启动语法。指定要使用的 blocks 和 threads 数量。
- cudaMalloc / cudaFree：分配/释放 GPU 内存（类似于 new/delete，但适用于 GPU）。
- cudaMemcpy：CPU和GPU之间复制数据。这通常是最大的瓶颈（PCIe 带宽约为 32 GB/s，而 GPU 内存带宽约为 3 TB/s）。

扭曲和 SIMT¶

GPU 以 32 个为一组执行 threads，称为 warps。 warp 中的所有 32 个 threads 同时执行相同的指令（单指令，多thread - SIMT）。这是 GPU 与 SIMD 的等效项，但处于 thread 级别。
当同一 warp 中的 threads 采用 if 语句的不同 branches 时，就会发生 扭曲发散。 GPU 不能在一个 warp 中同时执行两条不同的指令，因此它会顺序执行两条 branches，屏蔽掉不应该参与的 threads。这会使性能减半（或更糟）。

// BAD: warp divergence (threads in same warp take different paths)
if (threadIdx.x % 2 == 0) {
    c[idx] = a[idx] + b[idx];    // even threads do this
} else {
    c[idx] = a[idx] - b[idx];    // odd threads do this (same warp, serialised)
}

// BETTER: branchless (no divergence)
float sign = (threadIdx.x % 2 == 0) ? 1.0f : -1.0f;
c[idx] = a[idx] + sign * b[idx];  // all threads execute the same instruction

内存merge¶

merge访问：当连续的 threads 访问连续的内存地址时，GPU 将它们组合成单个内存事务。这对于性能至关重要。

// GOOD: coalesced — thread 0 reads a[0], thread 1 reads a[1], ...
c[idx] = a[idx] + b[idx];

// BAD: strided — thread 0 reads a[0], thread 1 reads a[stride], ...
c[idx] = a[idx * stride] + b[idx * stride];  // stride > 1 wastes bandwidth

对于 32 threads 的 warp，merge访问在一次事务中加载 128 字节（对于 float32，为 32 × 4 字节）。跨步访问需要多个事务，每个事务加载 128 个字节，但仅使用一小部分。步幅为 32 是最坏的情况：每个事务加载 128 个字节，但只有一个 thread 使用 4 个字节（利用率为 3%）。

shared memory和平铺¶

tiling 模式是最重要的 GPU 优化技术。想法：将 block 数据从慢速 global memory 加载到快速 shared memory 中，对其进行计算，然后将结果写回。

// Matrix multiply with shared memory tiling (simplified)
__global__ void matmul_tiled(const float* A, const float* B, float* C,
                              int M, int N, int K) {
    // Shared memory for one tile of A and one tile of B
    __shared__ float tile_A[TILE_SIZE][TILE_SIZE];
    __shared__ float tile_B[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // Loop over tiles
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load one tile of A and B into shared memory
        if (row < M && t * TILE_SIZE + threadIdx.x < K)
            tile_A[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            tile_A[threadIdx.y][threadIdx.x] = 0.0f;

        if (col < N && t * TILE_SIZE + threadIdx.y < K)
            tile_B[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tile_B[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // wait for all threads to finish loading

        // Compute partial dot product from this tile
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
        }

        __syncthreads();  // wait before loading the next tile
    }

    if (row < M && col < N)
        C[row * N + col] = sum;
}

__shared__：声明block中所有threads共享的内存（快速，片上）。
__syncthreads()：一个屏障，等待block中的所有threads都到达这一点。在写入 shared memory 和从中读取之间需要（否则某些 threads 会读取陈旧数据）。
为什么 tiling 有效：没有它，每个 thread 都会从 global memory 加载每次乘法。对于 tiling，TILE_SIZE × TILE_SIZE block 数据一次加载到 shared memory 中，并由 block 中的所有 threads 重复使用。重用因子为 TILE_SIZE，按该因子减少 global memory 流量。

流和并发¶

默认情况下，CUDA 操作是顺序的：CPU 启动 kernel，等待其完成，然后启动下一个。流启用重叠：

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// These operations can overlap: different streams execute concurrently
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream2);

kernel1<<<grid, block, 0, stream1>>>(d_a, d_c);
kernel2<<<grid, block, 0, stream2>>>(d_b, d_d);

流将数据传输与计算重叠：当一个流的 kernel 运行时，另一个流复制数据。这隐藏了 PCIe 传输延迟并使 GPU 保持忙碌。

分析 CUDA 代码¶

# NVIDIA Nsight Compute: kernel-level profiling
ncu --set full ./my_program

# NVIDIA Nsight Systems: system-level timeline
nsys profile ./my_program

# Quick metrics
ncu --metrics sm__throughput,dram__throughput ./my_program

要寻找什么：
- 占用率：已使用的 SM 容量的比例。低 occupancy (< 50%) 意味着 threads 太少，无法隐藏内存延迟。原因：每个 thread 的 registers 太多，每个 block 的 shared memory 太多。
- 内存吞吐量：与峰值带宽进行比较。如果达到峰值的 50% 以下，则内存访问模式效率低下（非merge、组conflict）。
- 计算吞吐量：与峰值 FLOPS 进行比较。如果内存和计算吞吐量都很低，则 kernel 会受到延迟限制（并行度不够）。

先进的优化技术¶

除了merge和 shared memory tiling 的基础知识之外，高性能 GPU（和 CPU）代码还使用了多种高级技术：

数据布局：AoS 与 SoA¶

结构数组 (AoS)：每个元素将其所有字段存储在一起。 [{x,y,z}, {x,y,z}, {x,y,z}]。
数组结构 (SoA)：每个字段都存储在自己的连续数组中。 {[x,x,x], [y,y,y], [z,z,z]}。

// AoS: BAD for SIMD/GPU (accessing all x values touches non-contiguous memory)
struct Particle { float x, y, z, mass; };
Particle particles[N];
// particles[0].x, particles[1].x are 16 bytes apart

// SoA: GOOD for SIMD/GPU (all x values are contiguous)
struct Particles {
    float x[N], y[N], z[N], mass[N];
};
// x[0], x[1] are 4 bytes apart — perfect for coalesced access and SIMD

对于数据并行工作负载（SIMD、GPU），SoA 几乎总是更快。当您始终一起访问一个元素的所有字段时（在数字代码中很少见），AoS 会更好。 PyTorch 张量本质上是 SoA：每个特征都是一个连续的维度。

软件预取¶

可以告诉 CPU 在需要数据之前开始加载数据，隐藏内存延迟：

#include <xmmintrin.h>  // for _mm_prefetch

for (int i = 0; i < n; i += 4) {
    _mm_prefetch((char*)(a + i + 64), _MM_HINT_T0);  // prefetch 64 elements ahead
    // process a[i:i+4] with SIMD
    __m128 va = _mm_load_ps(a + i);
    // ...
}

预取指令是一个提示：如果数据已经在缓存中，则它是一个空操作。如果没有，CPU 开始在后台获取它，同时执行其他指令。应调整预取距离（本例中为前面 64 个元素）以匹配内存延迟和循环迭代时间。

kernel融合¶

kernel融合 将多个操作merge到单个 kernel 中，以避免将中间结果写入内存。这是对 ML 最有影响力的 GPU 优化：

// UNFUSED: 3 kernel launches, 3 global memory round-trips
y = matmul(x, W)     // write y to global memory
z = y + bias          // read y, write z
out = relu(z)         // read z, write out

// FUSED: 1 kernel launch, 1 global memory write
out = fused_matmul_bias_relu(x, W, bias)  // y and z never leave SRAM

对于内存限制操作（偏置添加、ReLU、层规范），内存流量主导执行时间。熔断完全消除了流量。 PyTorch 的 torch.compile 和 Triton 可以自动或轻松地实现融合。

混合精度kernel¶

使用较低精度（FP16、BF16、INT8）进行计算并使用较高精度（FP32）进行累加可以实现两全其美：

// Tensor Core: multiply FP16 matrices, accumulate in FP32
// Each Tensor Core instruction: D (FP32) = A (FP16) × B (FP16) + C (FP32)
nvcuda::wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

FP16 比 FP32 小 2 倍，因此它使内存带宽（通常的瓶颈）加倍，并在缓存中容纳 2 倍多的数据。 Tensor 核心处理 FP16 的速度是 FP32 CUDA 核心的 8-16 倍。这就是为什么混合精度 training（第 6 章）提供 2-3 倍加速且精度损失最小的原因。

内存池分配器¶

cudaMalloc 很慢（每次调用约 1 毫秒），因为它与 GPU synchronisation。在每次迭代分配临时缓冲区的 training 循环中，这会累加。
内存池（PyTorch 的缓存分配器，CUDA 内存池）预先分配 GPU 内存的大 block 并从中进行子分配，无需系统调用：

# PyTorch does this automatically — but understanding why matters
# Each torch.empty() reuses memory from the pool, no cudaMalloc
temp = torch.empty(1024, 1024, device='cuda')  # microseconds, not milliseconds

这就是 PyTorch 的 torch.cuda.memory_allocated() 和 torch.cuda.max_memory_allocated() 不同的原因：分配的是当前正在使用的内容，最大值是峰值（池可能容纳比当前使用的更多的内容）。

配置文件引导优化¶

不要盲目优化。 首先分析，识别瓶颈，优化它，然后重新分析。屋顶线 model（文件 01）告诉您瓶颈是内存还是计算：
- 内存限制（低算术强度）：优化数据布局（SoA），熔断kernels，使用较低精度，预取。
- 计算密集（高算术强度）：使用 Tensor Core，增加并行性，使用更快的指令 (FMA)。
- 延迟限制（并行性不足）：增加 occupancy，减少 register 使用量，启动更多 threads。
大多数机器学习工作负载都是内存限制。令人惊讶的含义是：更快的 GPU（更多 FLOPS）通常没有帮助。更快的内存（HBM3 与 HBM2e）有更多帮助。这就是为什么 A100→H100 升级不仅仅是 FLOPS——H100 还具有 2 倍的内存带宽。

NVIDIA GPU 世代¶

一代	年	关键创新	人工智能相关性
帕斯卡 (P100)	2016	HBM2、NVLink	第一次认真深度学习GPU
沃尔特 (V100)	2017	张量核心（混合精度 matmul）	启用 FP16 training，125 TFLOPS TF32
安培 (A100)	2020	TF32，稀疏性，第三代张量核心	312 TFLOPS TF32，结构稀疏度 2:4
料斗 (H100)	2022	变压器引擎 (FP8)，HBM3	989 TFLOPS FP8，动态精准切换
布莱克威尔 (B200)	2024	第二代变压器引擎，NVLink 5	2.5 PFLOPS FP4，多芯片设计

张量核心是专门的矩阵乘法单元。一条 Tensor Core 指令在一个周期内计算 4×4 矩阵乘法 (D = A×B + C)。常规 CUDA kernel需要 64 个 FMA 操作。 Tensor Core 是混合精度 training（float16 计算、float32 累加）速度快的原因。
变压器引擎 (Hopper+) 在单层内动态切换 FP8 和 FP16 精度，仅在需要时选择更高的精度。这样可以在不牺牲 model 质量的情况下最大限度地提高吞吐量。它专为主导现代人工智能的 transformer 架构（attention + MLP）而设计。

编码任务（使用 nvcc 编译）¶

编写一个 CUDA kernel，将 ReLU 应用于数组。测量时间，包括内存传输。这教导了 kernel 编写、cudaMalloc/cudaMemcpy 和主机↔设备传输瓶颈。

// task1_relu.cu
// Compile: nvcc -O3 -o task1_relu task1_relu.cu

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

__global__ void relu_kernel(const float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = input[idx] > 0.0f ? input[idx] : 0.0f;
    }
}

int main() {
    const int N = 1 << 24;  // ~16M elements
    size_t bytes = N * sizeof(float);

    // Allocate host memory
    float* h_input  = (float*)malloc(bytes);
    float* h_output = (float*)malloc(bytes);
    for (int i = 0; i < N; i++) {
        h_input[i] = (float)(i % 100) - 50.0f;  // mix of positive and negative
    }

    // Allocate device memory
    float *d_input, *d_output;
    cudaMalloc(&d_input, bytes);
    cudaMalloc(&d_output, bytes);

    // Time the full pipeline: copy to GPU, compute, copy back
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    cudaMemcpy(d_input, h_input, bytes, cudaMemcpyHostToDevice);

    int block_size = 256;
    int grid_size = (N + block_size - 1) / block_size;
    relu_kernel<<<grid_size, block_size>>>(d_input, d_output, N);

    cudaMemcpy(h_output, d_output, bytes, cudaMemcpyDeviceToHost);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms = 0;
    cudaEventElapsedTime(&ms, start, stop);

    // Verify
    int errors = 0;
    for (int i = 0; i < N; i++) {
        float expected = h_input[i] > 0.0f ? h_input[i] : 0.0f;
        if (h_output[i] != expected) errors++;
    }

    printf("Time (including transfers): %.2f ms\n", ms);
    printf("Bandwidth: %.1f GB/s\n", 2.0 * bytes / ms / 1e6);  // read + write
    printf("Errors: %d / %d\n", errors, N);

    cudaFree(d_input); cudaFree(d_output);
    free(h_input); free(h_output);
    return 0;
}

使用 shared memory 在 CUDA 中编写平铺矩阵乘法。将性能与原始（非平铺）版本进行比较。这教导了 shared memory、__syncthreads，以及为什么 tiling 很重要。

// task2_matmul.cu
// Compile: nvcc -O3 -o task2_matmul task2_matmul.cu

#include <stdio.h>
#include <cuda_runtime.h>

#define TILE 16

// Naive matmul: each thread computes one element of C
__global__ void matmul_naive(const float* A, const float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

// Tiled matmul: use shared memory to reduce global memory accesses
__global__ void matmul_tiled(const float* A, const float* B, float* C, int N) {
    __shared__ float sA[TILE][TILE];
    __shared__ float sB[TILE][TILE];

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < (N + TILE - 1) / TILE; t++) {
        sA[threadIdx.y][threadIdx.x] = (row < N && t*TILE+threadIdx.x < N)
            ? A[row * N + t*TILE + threadIdx.x] : 0.0f;
        sB[threadIdx.y][threadIdx.x] = (t*TILE+threadIdx.y < N && col < N)
            ? B[(t*TILE + threadIdx.y) * N + col] : 0.0f;

        __syncthreads();
        for (int k = 0; k < TILE; k++)
            sum += sA[threadIdx.y][k] * sB[k][threadIdx.x];
        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

int main() {
    const int N = 1024;
    size_t bytes = N * N * sizeof(float);

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes); cudaMalloc(&d_B, bytes); cudaMalloc(&d_C, bytes);

    // Initialise with ones (easy to verify: C should be all N)
    float* h_A = new float[N*N];
    for (int i = 0; i < N*N; i++) h_A[i] = 1.0f;
    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_A, bytes, cudaMemcpyHostToDevice);

    dim3 block(TILE, TILE);
    dim3 grid((N+TILE-1)/TILE, (N+TILE-1)/TILE);

    // Benchmark naive
    cudaEvent_t start, stop;
    cudaEventCreate(&start); cudaEventCreate(&stop);

    cudaEventRecord(start);
    for (int i = 0; i < 10; i++)
        matmul_naive<<<grid, block>>>(d_A, d_B, d_C, N);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float naive_ms; cudaEventElapsedTime(&naive_ms, start, stop);

    // Benchmark tiled
    cudaEventRecord(start);
    for (int i = 0; i < 10; i++)
        matmul_tiled<<<grid, block>>>(d_A, d_B, d_C, N);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float tiled_ms; cudaEventElapsedTime(&tiled_ms, start, stop);

    double gflops_naive = 2.0 * N * N * N * 10 / naive_ms / 1e6;
    double gflops_tiled = 2.0 * N * N * N * 10 / tiled_ms / 1e6;

    printf("Naive:  %.2f ms, %.1f GFLOPS\n", naive_ms/10, gflops_naive);
    printf("Tiled:  %.2f ms, %.1f GFLOPS\n", tiled_ms/10, gflops_tiled);
    printf("Speedup: %.1fx\n", naive_ms / tiled_ms);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    delete[] h_A;
    return 0;
}

演示 warp 散度。编写一个 kernel，其中同一 warp 中的 threads 采用不同的 branches，并与无branch版本进行比较。

// task3_divergence.cu
// Compile: nvcc -O3 -o task3_diverge task3_divergence.cu

#include <stdio.h>
#include <cuda_runtime.h>

// BAD: warp divergence — even/odd threads take different paths
__global__ void divergent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        if (idx % 2 == 0) {
            data[idx] = data[idx] * 2.0f + 1.0f;
        } else {
            data[idx] = data[idx] * 0.5f - 1.0f;
        }
    }
}

// GOOD: branchless — all threads execute the same instruction
__global__ void branchless_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float scale = (idx % 2 == 0) ? 2.0f : 0.5f;
        float offset = (idx % 2 == 0) ? 1.0f : -1.0f;
        data[idx] = data[idx] * scale + offset;
    }
}

int main() {
    const int N = 1 << 24;
    float* d_data;
    cudaMalloc(&d_data, N * sizeof(float));
    cudaMemset(d_data, 0, N * sizeof(float));

    int block = 256, grid = (N + block - 1) / block;

    cudaEvent_t start, stop;
    cudaEventCreate(&start); cudaEventCreate(&stop);

    // Divergent
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        divergent_kernel<<<grid, block>>>(d_data, N);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float div_ms; cudaEventElapsedTime(&div_ms, start, stop);

    // Branchless
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        branchless_kernel<<<grid, block>>>(d_data, N);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float nodiv_ms; cudaEventElapsedTime(&nodiv_ms, start, stop);

    printf("Divergent:  %.2f ms\n", div_ms / 100);
    printf("Branchless: %.2f ms\n", nodiv_ms / 100);
    printf("Speedup:    %.2fx\n", div_ms / nodiv_ms);

    cudaFree(d_data);
    return 0;
}