Git 和version control¶

Git 是软件团队在不覆盖彼此工作的情况下进行协作的方式。该文件涵盖了 Git mental model、branch 策略、merge和rebase、conflict 分辨率、pull request以及管理 ML 特定挑战（例如大文件和实验跟踪）。

每个严肃的软件项目都使用 version control。 Git 是主要系统，几乎所有开源项目和公司都在使用。如果没有 git，协作就是通过电子邮件发送 zip 文件并祈祷没有人覆盖您的更改。使用 git，每个更改都可以被跟踪、可逆且可归因。
对于 ML 工程师：git 跟踪您的代码、配置和实验 scripts。与实验跟踪工具相结合，它为您提供了可重复性：“什么确切的代码和 config 产生了这个 model？”

心智model¶

Git 跟踪项目的快照。每个 commit 都是当时所有跟踪文件的完整快照，而不是差异（在内部，git 存储差异以提高效率，但从概念上讲，每个 commit 都是一个完整的状态）。
文件的四个“位置”：
1. working directory：磁盘上的实际文件。你编辑这些。
2. staging area（索引）：您为下一个 commit 标记的文件。 git add 在这里移动更改。
3. local repository：您的commit历史记录，存储在.git/中。 git commit 将 staging area 保存为新快照。
4. remote repository（例如，GitHub）：共享副本。 git push上传你的commits，git pull下载别人的。

Working Dir  →  git add  →  Staging  →  git commit  →  Local Repo  →  git push  →  Remote
                                                        ←  git pull  ←

staging area 使 git 变得强大。您可以编辑 10 个文件，但只能编辑 commit 中的 3 个，将其他更改保留为单独的 commit。这使得 commits 变得干净、专注。

基本命令¶

git init                          # create a new repository
git clone url                     # download a remote repository
git status                        # what has changed? (most-used command)
git add file.py                   # stage a specific file
git add .                         # stage all changes (use with caution)
git commit -m "descriptive msg"   # commit staged changes
git push                          # upload commits to remote
git pull                          # download + merge remote changes
git log --oneline                 # compact commit history
git diff                          # show unstaged changes
git diff --staged                 # show staged changes

分枝¶

branch 是指向 commit 的指针。默认 branch 为 main（或 master）。创建 branch 为您提供了独立的开发线：您可以进行更改而不影响 main。

git branch feature-x              # create a branch
git checkout feature-x            # switch to it
git checkout -b feature-x         # create and switch in one step
git branch -d feature-x           # delete branch (after merging)
git branch -a                     # list all branches (local + remote)

何时到 branch：始终。切勿将 commit 直接转为 main。每个功能、错误修复或实验都有自己的 branch。这使 main 保持稳定和可部署。

branch 策略¶

功能 branches（最常见）：每个功能/修复都会从 main 中获得 branch。完成后，打开一个pull request（PR）到merge回来。简单，适用于大多数团队。
基于主干的开发：经常（每天多次）开发 commit 到 main，使用功能标志隐藏不完整的工作。持续部署的团队（Google、Facebook）的首选。需要优秀的CI/CD。
Gitflow：单独的 branches 用于功能、版本和修补程序。更复杂，更适合具有version control的软件（移动应用程序、打包软件）。对于大多数机器学习项目来说都有些过头了。
对于 ML 团队：功能 branches 和短暂的 branches（1-3 天内的 merge）是最佳选择。长期存在的 branches 与 main 不同，并产生痛苦的 merge conflict。

merge和rebase¶

merge创建一个新的“merge commit”，它组合了两个 branches：

git checkout main
git merge feature-x

这保留了完整的历史记录：您可以看到 branch 上发生的工作以及merge的时间。 merge commit 有两个父母。
Rebase 在目标 branch 之上重放您的 branch 的 commits：

git checkout feature-x
git rebase main

这将重写历史：您的 branch 的 commits 获得新的哈希值，就好像您从 main 的当前提示开始工作一样。结果是一个线性历史记录（没有 merge commits），读起来更清晰。
何时使用哪个：
- Rebase 用于使用最新的 main 更改更新您的 feature branch（保持您的 branch 干净且最新）。
- merge用于将您的 feature branch 集成到 main（保留 branch 历史记录）。
- 切勿与他人推送和分享的rebase commits。rebase重写历史；如果其他人在原始 commits 基础上进行了工作，则重新基础会导致混乱。

解决conflict¶

当两个 branches 修改同一文件的同一行时，会出现 conflict。 Git 无法自动决定保留哪些更改，并要求您手动解决。

<<<<<<< HEAD
learning_rate = 0.001
=======
learning_rate = 0.0005
>>>>>>> feature-x

<<<<<<< HEAD和=======之间是当前branch的版本。 ======= 和 >>>>>>> feature-x 之间是传入的 branch 的版本。您决定保留（或组合它们）、删除标记、保存并 git add 解析的文件。
陷阱：不要在commit的文件中留下 conflict 标记。它们是会破坏您的代码的文字文本。解析后始终搜索<<<<<<<。
减少conflict：保持branches短暂，merge main经常进入你的branch，并避免多人同时编辑同一个文件。

编写良好的commit消息¶

commit 消息是写给未来的自己和队友的。 “修复错误”什么也不告诉你。 “修复批量大小计算中导致 8-GPU training 上 OOM 的偏差”告诉您一切。
格式：

Short summary (50 chars or less, imperative mood)

Longer description if needed. Explain WHY, not WHAT
(the diff shows what changed). Wrap at 72 characters.

Fixes #123

祈使语气：“添加功能”而不是“添加功能”或“添加功能”。将其解读为完成句子：“如果应用，此 commit 将添加功能。”
原子commits：每个commit应该做一件事。 “添加数据加载器”是一个commit。 “添加数据加载器并修复不相关的错误并更新README”应该是三个commits。这使得 git bisect（查找哪个 commit 引入了错误）成为可能。

pull request和代码审查¶

pull request (PR)建议将 branch merge到 main 中。它是 code review 的网关：队友阅读您的更改、提出改进建议并在merge之前批准。
良好的 PR 做法：
- 保持 PR 较小（更改少于 400 行）。大型 PR 会被盖上橡皮图章，因为没有人愿意审查 2000 行。
- 写下清晰的描述：发生了什么变化、原因以及如何测试。
- 指向引发变更的问题或票证的链接。
- 及时回复审核意见。
- 在merge之前压缩琐碎的 commits（因此 main 具有干净的历史）。
代码审查不是为了发现错误（测试就是这样做的）。它是关于：知识共享（审阅者学习 codebase）、设计反馈（这是正确的方法吗？）以及维护标准（命名、风格、架构）。

.gitignore¶

.gitignore 文件告诉 git 从跟踪中排除哪些文件。对于机器学习项目：

# Python
__pycache__/
*.pyc
*.egg-info/
.venv/
env/

# Data and models (too large for git)
data/
*.csv
*.parquet
models/
*.pt
*.onnx
*.bin
checkpoints/

# Secrets
.env
*.pem
credentials.json

# IDE
.vscode/
.idea/
*.swp

# OS
.DS_Store
Thumbs.db

# Jupyter
.ipynb_checkpoints/

# Experiment outputs
wandb/
mlruns/
outputs/
logs/

陷阱：commit文件后将其添加到 .gitignore 不会将其从 repository 中删除。您还必须 git rm --cached file 才能取消跟踪。该文件将永远保留在历史记录中，除非您重写历史记录（这很混乱）。

Git 用于机器学习¶

机器学习带来了传统软件不会面临的挑战：
大文件：dataset和 model 权重为 GB 或更多。 Git 是为文本文件（源代码）而不是二进制 blob 设计的。解决方案：
- Git LFS（大文件存储）：跟踪 git 中的指针，将实际文件存储在单独的服务器上。简单，但 GitHub 具有存储/带宽限制。
- DVC（数据version control）：使用 remote 存储（S3、GCS）与 git 分开管理数据和 model 文件。像 git 一样处理数据：dvc add data.csv、dvc push、dvc pull。
实验跟踪：哪个 commit + 哪些超参数 + 哪些数据产生了哪些指标？ Git 跟踪代码，但不跟踪完整的实验上下文。
- 权重和偏差 (W&B)：记录指标、超参数、系统信息以及 git commit 的链接。提供用于比较运行的仪表板。
- MLflow：使用 model 注册表进行开源实验跟踪。记录参数、指标和工件。
- 简单的方法：在您的 training script: git_hash = subprocess.check_output(['git', 'rev-parse', 'HEAD']).strip() 中记录 git 哈希。将其与结果一起存储。
再现性检查表（每个实验要跟踪的内容）：
- Git commit 哈希（确切的代码版本）
- 配置文件/超参数
- 随机种子
- Python 和 library 版本 (pip freeze)
- 数据版本（DVC哈希或dataset版本标签）
- 硬件（GPU类型，GPU数量）

# Quick reproducibility snapshot
echo "Commit: $(git rev-parse HEAD)" > experiment_info.txt
echo "Branch: $(git branch --show-current)" >> experiment_info.txt
echo "Dirty: $(git status --porcelain | wc -l) files" >> experiment_info.txt
pip freeze >> experiment_info.txt
nvidia-smi >> experiment_info.txt