ML 设计示例¶

学习 ML systems design 的最好方式是看完整案例。本文走读 7 个完整设计：recommendation systems、search ranking、ads click prediction、fraud detection、content moderation、conversational AI，以及 large-scale image search。

每个案例都遵循同一框架：
1. Problem framing：要构建什么、用户是谁、约束是什么？
2. Data：有哪些数据、怎么采集、如何标注？
3. Features：model 需要哪些 features？
4. Model：用什么 architecture 与 training 方案？
5. Serving：model 如何部署与服务？
6. Evaluation：如何衡量是否成功？
7. Iteration：后续如何持续改进？

1. Recommendation System（如 YouTube、Netflix、Spotify）¶

Problem Framing¶

Goal：向用户展示他们会喜欢的内容，最大化 engagement（watch time、listens、clicks）。
Scale：1B+ users、100M+ items、每秒 10K+ 推荐请求。
Latency：完整 recommendation pipeline <200ms。
核心挑战：candidate 空间极大（100M items），无法实时为所有用户给所有 items 打分。

Architecture：Two-Stage Pipeline¶

Recommendation pipeline: 100M items narrowed to 1000 by candidate generation, ranked to 100, re-ranked to 20 shown items

100M items → Candidate Generation (fast, coarse) → 1000 candidates
          → Ranking (slow, precise) → 100 ranked items
          → Re-ranking (business rules) → 20 shown to user

Candidate Generation¶

Goal：把 100M items 缩小到约 1000 个候选，必须足够快（<50ms）。
Two-tower model：将 user 与 item 编码到同一 embedding space。user embedding 表示偏好，item embedding 表示内容特征，score = 两者 dot product。
Training：在 (user, positive_item, negative_items) 三元组上做 contrastive learning。positive 是用户有过 engagement 的 item；negative 是随机 item + hard negatives（热门但用户没互动过的 item）。
Serving：预先计算全部 item embeddings。请求到来时计算 user embedding，再用 ANN search（vector database 中的 HNSW）找最近的 1000 个 item embeddings。

Ranking¶

Goal：精细打分这 1000 个候选，预算约 100ms。
Model：deep neural network（MLP 或 transformer），输入丰富 features：user features（demographics、history、context）、item features（content、popularity、freshness）和 cross features（user-item interaction history、contextual relevance）。
Output：预测 engagement 概率（click、watch 50%+、like、share）。可做多目标加权：$\text{score} = w_1 \cdot P(\text{click}) + w_2 \cdot P(\text{watch}) + w_3 \cdot P(\text{like})$。

Re-ranking¶

应用 business rules：diversity（避免同一创作者连刷 5 条）、freshness（提升新内容）、safety（过滤风险内容）、personalised exploration（插入部分低排位但有潜力内容）。

Back-of-Envelope Numbers¶

Item embedding index：100M × 256-dim × float16 = 50 GB；HNSW 约有 2x 额外开销，总计约 100 GB。可放在 128 GB RAM 单机，或拆到 4 台 32 GB 机器。
User embedding computation：每个 user 约 5ms（小型 MLP）。10K QPS 下约需 50 个 model replicas。
ANN search：100M 向量上用 HNSW 取 top-1000 约 2ms。10K QPS 时每个 index replica 承担约 500 QPS，约需 20 replicas。
Ranking model：1000 candidates × 每个约 0.1ms = 每请求约 100ms。10K QPS 约需每秒 1000 GPU-seconds，即仅 ranking 约 10 张 A10G。
总体基础设施：约 20 个 embedding index replicas + 50 个 user embedding servers + 10 张 ranking GPUs + caching + load balancers。云上成本约 $50K-$100K/月。

Cold Start¶

New users（无历史）：先用 demographics、device/location context 和 popularity-based 推荐；5-10 次交互后切到 personalised model。
New items（无 engagement data）：先用 content-based features（title、description、thumbnail embeddings）。预留 exploration budget，把新 item 展示给一部分用户快速收集数据；boost 期后仍无 engagement 则降权。
Cold start 是系统问题：feature store 必须优雅处理缺失特征（返回默认值而非报错）；model 也应在训练中见过缺失特征（例如对 user history features 做 dropout 模拟新用户）。

Evaluation¶

Offline：held-out set 上看 NDCG（Normalized Discounted Cumulative Gain）、recall@K、precision@K。
Online：A/B test 观察 watch time、DAU、retention。需有周级长周期实验，以捕捉短测看不到的长期留存影响。

2. Search Ranking（如 Google、Bing）¶

Problem Framing¶

Goal：给定 user query，从十亿级 document 语料中返回最相关结果。
Latency：总计 <500ms（100ms retrieval + 200ms ranking + 100ms rendering + 其余开销）。

Architecture：Query Understanding → Retrieve → Rank¶

Query Understanding¶

在 retrieval 之前，先处理 raw query 以提高结果质量：
Spell correction：如 "reccomendation systm" → "recommendation system"。可用 edit-distance model 或在搜索日志 (misspelled, corrected) 样本上训练 sequence-to-sequence model。
Query expansion：补充相关词提高 recall。如 "Python ML" → "Python machine learning scikit-learn pytorch"。可用 synonym dictionary、word embeddings 或 LLM 生成扩展词。
Intent classification：识别用户意图。"buy Nike shoes" 属于 transactional（应展示商品页）；"How does backpropagation work" 属于 informational（应展示文章）；"facebook.com" 属于 navigational（应直接导航）。不同意图应触发不同 retrieval 策略与结果样式。
Entity recognition：从 query 抽取实体。如 "best restaurants near Times Square" 可识别 location="Times Square"、entity type="restaurants"，再路由到 location-aware 搜索链路。

Retrieval¶

BM25（传统）：基于 inverted index 的 term matching。速度快、关键词检索有效，但缺乏 semantic understanding（"dog food" 不一定匹配 "canine nutrition"）。
Dense retrieval：将 query 与 document 编码成 embeddings（如 DPR、ColBERT 这类 bi-encoder），再做 ANN search，可捕捉语义相似（"dog food" 匹配 "canine nutrition"）。比 BM25 慢，但对自然语言查询更好。
Hybrid retrieval：组合 BM25 与 dense retrieval。BM25 找精确关键词，dense 找语义匹配，再 merge 去重，兼顾两者优势。

Ranking¶

Learning to rank：对每个 (query, document) 对打分。常见三类：
- Pointwise：独立预测每个 document 的 relevance score，简单但忽略相对顺序。
- Pairwise：学习两个 document 谁更相关，经典方案是 LambdaMART（gradient-boosted trees）。
- Listwise：直接优化整个排序列表的指标（如 NDCG），更复杂但效果通常最好。
Cross-encoder：将 [query, document] 联合输入 transformer 输出 relevance score。因能捕捉细粒度交互，精度高于 bi-encoder（后者 query/document 分开编码）；但太慢，不能对全量语料使用，通常只对 retrieval 后 top 100-1000 做 re-ranking。

Features¶

Query features：query length、language、intent classification（navigational/informational/transactional）。
Document features：PageRank、freshness、content quality score、domain authority。
Query-document features：BM25 score、embedding similarity、exact match count、历史日志中该 (query, document) 对的 click-through rate。

3. Ads Click Prediction¶

Problem Framing¶

Goal：预测用户点击 ad 的概率，用于 real-time auction 的出价计算。
Scale：每秒 100K+ auctions，每次预测预算 10ms 内。
Revenue impact：click prediction accuracy 提升 0.1%，可能对应数百万级新增收入。

Architecture¶

Feature engineering 是 ads system 的核心，典型包括：
- User features：demographics、browsing history、purchase history、device、location、time of day。
- Ad features：creative（image/text）、advertiser、category、historical CTR、bid amount。
- Context features：page content、ad position、device type、connection speed。
- Cross features：user_category × ad_category、user_region × ad_campaign 等交互特征。
Model：早期多用 logistic regression（简单、快、可解释）；现代系统常用 deep learning，如 DLRM（Deep Learning Recommendation Model）：categorical features 用 embedding tables，dense features 用 MLP。
Calibration：预测概率必须可校准（若模型说 P(click)=0.05，实际应接近 5% 点击），因为该概率会直接影响出价。
Exploration-exploitation：总是展示当前最优 ad 会伤害长期收益（无法发现新 ad 潜力）。可用 Thompson sampling 或 $\epsilon$-greedy，让一部分曝光给低确定性 ad 收集数据。

Real-Time Bidding¶

用户打开页面后，<100ms 内完成 ad auction：
1. Publisher 向多个 ad exchanges 发送 bid request（含 user 信息与页面上下文）。
2. 各 advertiser 的 bidding server 预测其 ad 的 CTR。
3. Bid = CTR × value_per_click，出价高者获胜。
4. 展示 winning ad；若发生点击，advertiser 付费。

4. Fraud Detection¶

Problem Framing¶

Goal：实时检测欺诈交易（credit card fraud、account takeover、fake reviews）。
Latency：<100ms（付款链路中必须先放行或拦截）。
核心挑战：极端 class imbalance（fraud rate 约 0.1%）。false positives 会误伤正常用户，false negatives 会造成直接损失。

Architecture¶

Fraud detection pipeline: transaction → real-time feature pipeline → ML model → decision engine → allow/review/block, with human review feeding labels back for retraining

Features¶

Transaction features：amount、currency、merchant category、time of day、is_international。
User features：account age、平均交易金额、近期交易次数、device fingerprint。
Velocity features（实时来自 streaming pipeline）：最近 5 分钟交易数、最近 1 小时商户去重数、与上一笔交易的地理距离。
Graph features：该 merchant 是否与已知 fraud ring 相连？该 device 是否与被标记账户共享？

Model¶

Gradient-boosted trees（XGBoost、LightGBM）是 tabular fraud detection 的常用标准：能处理混合特征类型、可解释（feature importance）且训练快。
处理不平衡：可对多数类 undersampling、少数类 oversampling（SMOTE），或在 loss function 使用 class weights。Focal loss（chapter 8）可下调 easy negatives 权重。
Cost matrix：false positive（拦截正常交易）和 false negative（漏报欺诈）的成本不同。阈值应最小化总期望成本，而不是单纯最大化 accuracy。

Human-in-the-Loop¶

不确定样本（model confidence 在 0.3-0.7）交给人工 reviewer。review 结果再作为 retraining 标签，形成反馈闭环，使 model 随 labeled fraud 样本增加而持续提升。

5. Content Moderation¶

Problem Framing¶

Goal：自动识别并移除平台有害内容（hate speech、violence、misinformation、CSAM）。
Scale：每天十亿级 posts（text、images、video）。
挑战：强上下文依赖（sarcasm、satire、cultural nuance），需平衡表达自由与平台安全。

Architecture¶

Multi-modal classification：text、image、video 分别建模，再通过 fusion layer 融合信号。
Text moderation：fine-tuned language model 把文本分到 harassment、hate speech、misinformation、spam 等类别；multilingual models 覆盖 100+ 语言。
Image moderation：vision model 检测 explicit content（nudity、violence）、图中文字（OCR + text classifier）和已知有害内容（与已知 CSAM 库做 hash-matching）。
Video moderation：按时间间隔抽帧，对每帧做 image classifier，并结合音频转写（ASR → text classifier）。
Policy-as-code：将 moderation policy 写成结构化规则，把 model 输出映射到具体动作：

if text_model.hate_speech_score > 0.9:
    action = "remove"
elif text_model.hate_speech_score > 0.7:
    action = "human_review"
else:
    action = "allow"

policy 变化频繁（法规更新、社会规范变化）。将 policy 与 model 解耦，可在不 retraining 的情况下快速上线规则调整。

Proactive vs Reactive Moderation¶

Proactive（发布前）：内容上线前先过 classifiers，高置信违规直接拦截。优点是有害内容不落地，缺点是增加发布 latency 且有 false positives 风险。
Reactive（发布后）：先上线，用户举报后再触发 classifier + human review。优点是发布低延迟，缺点是有害内容在被发现前可见。
多数平台混合使用：高危类别走 proactive（如 CSAM，零容忍，发布前阻断）；语义复杂类别走 reactive（如 misinformation，需人工判断，举报后复核）。

Hash Matching¶

对已知有害内容（CSAM、恐怖宣传）使用 perceptual hashing：生成对轻微修改（裁剪、缩放、压缩）鲁棒的 hash，与已知有害库（NCMEC hash database、GIFCT shared hash database）比对，命中则可直接下架，无需 classifier。
PhotoDNA（Microsoft）是 CSAM 检测中的标准 perceptual hash，在许多司法辖区是法律义务，不只是技术选择。

Back-of-Envelope Numbers¶

Scale：1B posts/day ≈ 每秒 12K posts。单帖可能需要 text classification（~5ms）、image classification（~20ms）、hash matching（~1ms）。在 12K QPS 下，约需 60 个 text classifiers、240 个 image classifiers、12 个 hash matchers（外加冗余）。
Human review：若 2% 内容进人工审核，即 20M/day。按每人每天 100 条，需 200K reviewers（这就是自动化准确率的价值：false positives 每降 0.1%，可减少约 1M/day 审核量）。
Latency budget：proactive moderation 需在发布链路内完成（~500ms）。text（5ms）+ image（20ms）+ hash（1ms）+ overhead 通常可满足；video 是例外，10 分钟视频即便按 1fps 抽帧也要 600 次分类，宜异步处理。

Escalation Workflow¶

自动移除 → 申诉人工复核 → 专家复核（法务、文化专家）→ policy team 处理模糊争议。层级越高，案例量越小但判断更细。
反馈给 model：人工审核结论是高质量标签来源。model 与 reviewer 分歧样本应优先进入 active learning，因为它们恰是 model 当前最薄弱的区域。

6. Conversational AI（RAG-Based Chatbot）¶

Problem Framing¶

Goal：基于公司文档构建 chatbot，回答产品相关问题。
Requirements：答案准确（不 hallucinate）、可引用来源、能处理 follow-up questions、并保持在产品领域内。

RAG architecture: embed query, search vector DB for relevant chunks, rerank, feed to LLM with original query to generate grounded response

Architecture：Retrieval-Augmented Generation（RAG）¶

User query → Query Embedding → Vector Search (documentation) → Top-K chunks
                                                                      ↓
User query + Retrieved chunks → LLM → Response (with citations)

Components¶

Document ingestion：先对文档 chunking 再 embedding。Chunking strategy 对效果影响很大：
- Fixed-size chunking：每 N tokens（如 500）切一段，并做 M-token overlap（如 50）。简单且长度稳定，但可能切断句子/段落，丢失语义。
- Semantic chunking：按段落或章节边界切，每块语义更完整；缺点是长度不一（有的 100 tokens、有的 800），要求 retrieval system 适配可变长度。
- Recursive chunking：优先按段落切，段落太长再按句子切，句子仍太长再固定长度切；在语义完整性与长度一致性之间更平衡。
- Embedding：用 text encoder（如 E5、BGE、Cohere embed）将 chunk 向量化并存入 vector database。
Retrieval：将 user query embedding 后在 vector database 搜索最相似的 $k$ 个 chunks（常用 $k = 5$-$10$），可选 cross-encoder rerank 提升精度。
Generation：把 retrieved chunks 作为 context 组装 prompt：

System: You are a helpful assistant. Answer based ONLY on the provided context.
If the answer is not in the context, say "I don't know."

Context:
[chunk 1]
[chunk 2]
...

User: {question}

Guardrails：防止 LLM 回答域外问题、生成有害内容或与 context 冲突。可通过 input filtering（拒绝 off-topic query）、output filtering（校验回答与 context 一致性）、constitutional prompting（明确拒答规则）实现。
Conversation memory：保留最近 $n$ 轮对话并加入 prompt，使 model 能理解追问（如 "What about the pricing?" 需要上一轮具体产品上下文）。

Query Rewriting¶

用户常提出歧义追问，如 "What about the pricing?"（哪个产品的 pricing？）。Query rewriting 结合对话历史生成可独立检索的问题：
- Input：conversation history + "What about the pricing?"
- Rewritten："What is the pricing for the enterprise plan of Product X?"
只有 rewritten query 才应送入 embedding 与 vector search；否则只搜 "pricing" 往往返回无关 chunks。
Query rewriting 可用一次小型 LLM 调用（~50ms），也可用 fine-tuned sequence-to-sequence model（~5ms）。

Back-of-Envelope Numbers¶

Documentation corpus：10K pages，每页平均 2000 tokens，共 20M tokens。按 500 tokens/chunk、50 overlap，约 44K chunks。
Embedding index：44K × 768-dim × float16 ≈ 65 MB，内存压力很小；即便 10M chunks 也约 15 GB。
Latency breakdown：query embedding（5ms）+ vector search（2ms）+ cross-encoder rerank（top-50 约 20ms）+ LLM generation（500-2000ms）≈ 600-2100ms。瓶颈在 LLM，可用 streaming 降低感知延迟。
Cost：按 $3/1M tokens（Claude/GPT-4 API）估算，1000 queries/day、每次约 2K tokens，约 $6/day；若到 1M queries/day，可自托管 7B model（2 张 A10G，约 $50/day），成本可降约 100x。

Evaluation¶

Retrieval quality：Recall@K（top-K 是否包含答案）、MRR（Mean Reciprocal Rank）。
Generation quality：factual accuracy（回答是否与 context 一致）、groundedness（引用是否对应正确 chunks）、answer relevance。
End-to-end：user satisfaction（thumbs up/down）、升级到人工客服的比例。

7. Large-Scale Image Search¶

Problem Framing¶

Goal：给定一张图，在 1B+ 图像语料中找到视觉相似图片。
Applications：reverse image search、商品图搜（photo → matching products）、重复图检测。
Latency：含网络往返在内 <500ms。

Architecture¶

Query image → Embedding Model (ViT/CLIP) → 512-dim vector → ANN Search → Top-K results

Embedding Extraction¶

Model：预训练 vision encoder（ViT、CLIP image encoder、DINOv2），必要时在具体领域（fashion、e-commerce、medical imaging）fine-tune。
Training：contrastive learning（chapter 10）。positive pairs 是同图不同视角（或 image + matching text），negative pairs 是随机图像。目标是让相似图 embedding 更近、不同图更远。

Indexing¶

Offline：先对 1B 图像全量 embedding，再构建 ANN index。以 HNSW（file 03）为例，建索引通常需数小时，索引驻内存约 128 GB（1B × 512-dim × float16 + 图结构开销）。
Sharding：将 index 拆到多机，每台持有一个 shard。查询时并行搜所有 shards，再 merge top-K。
Incremental updates：新增图像（上传、新商品）需持续入索引。HNSW 支持增量插入无需全量重建；Milvus、Pinecone 等 vector database 原生支持该流程。

Serving¶

Embedding service：GPU 服务运行 ViT，单图约 20ms，可批处理提升吞吐。
Search service：ANN index 服务，在 1B 向量上做 top-100 检索约 10ms（HNSW）。
Caching：对热门查询做结果缓存。做 duplicate detection 时，可先缓存近期上传图 embedding，优先与缓存比对，再决定是否搜索全量 index。

Evaluation¶

Precision@K：top-K 结果是否真的相似。
Recall@K：语料中所有真实相似图，有多少进入 top-K。
Mean Average Precision（mAP）：precision-recall 曲线下面积。
Human evaluation：针对主观相似性，使用人工标注判断检索结果相关性。

面试作答框架¶

遇到 systems design 题时，可按以下流程：
Clarify requirements（2-3 分钟）：确认 scale、latency、一致性要求、边界情况。例："用户量多大？可接受 latency 是多少？故障时怎么处理？"
High-level design（5-7 分钟）：画出主要组件和交互，从 happy path 开始，优先复用 01-03 文件中的通用模式。
Deep dive（15-20 分钟）：挑最关键或最难组件深入设计，展示深度。ML system 常见 deep dive 点：model architecture、feature pipeline、serving architecture。
Evaluation and monitoring（3-5 分钟）：怎么衡量成功？可能出什么问题？如何检测与响应？
Iteration（2-3 分钟）：如果有更多时间/资源，会优先改进什么？体现你对 tradeoff 与优先级的理解。
面试官关注点：是否结构化思考（不盲目跳方案）、是否理解 tradeoff（每个选择都有成本）、是否具备实践感（真的做过系统）、能否清晰沟通设计。