Linux 和 Command Line¶

command line 是 ML 工程的主要接口：training 作业、服务器管理、数据管道和集群管理均通过 terminal 进行。该文件涵盖了 shell、file system、permissions、process management、package managers、environment variables、SSH 以及每个 ML 工程师日常使用的基本命令。

GUI 可以方便地浏览网页。对于凌晨 2 点在 remote GPU cluster 上运行 training 作业来说，它们非常糟糕。 command line（或 terminal 或 shell）是可扩展的工具：它可以在任何机器上运行，可以编写scripts，可组合，并且在您的笔记本电脑、cloud VM 和 HPC cluster 上是相同的。
如果您是一名仅使用 Jupyter 笔记本和 VS Code 按钮的 ML 工程师，那么您将失去巨大的生产力。每个生产机器学习系统都通过 command line 进行部署、监控和调试。

Shell¶

shell 是一个从您那里读取命令并执行它们的程序。它是你和操作系统之间的中介（第 13 章）。最常见的 shell 是 bash（大多数 Linux 系统上的默认值）和 zsh（macOS 上的默认值）。
命令的形式为：command [options] [arguments]

ls -la /home/user    # command=ls, options=-la, argument=/home/user

选项修改行为（短形式通常以 - 为前缀，长形式以 -- 为前缀）。 ls -l 以长格式列出，ls --all 显示隐藏文件。许多选项可以组合：ls -la 表示 -l 和 -a 在一起。

基本导航¶

pwd                 # print working directory (where am I?)
ls                  # list files in current directory
ls -la              # list all files (including hidden) with details
cd /path/to/dir     # change directory
cd ..               # go up one level
cd ~                # go to home directory
cd -                # go back to previous directory

文件操作¶

cp source dest      # copy file
cp -r dir1 dir2     # copy directory recursively
mv old new          # move/rename file
rm file             # delete file (no recycle bin — gone forever)
rm -rf dir          # delete directory recursively (DANGEROUS — no confirmation)
mkdir -p a/b/c      # create nested directories
touch file.txt      # create empty file (or update timestamp)
cat file.txt        # print file contents
head -n 20 file     # first 20 lines
tail -f logfile     # follow a log file in real-time (invaluable for monitoring training)

陷阱：rm -rf 是计算中最危险的命令。无法撤消。按 Enter 键之前三次检查路径。切勿运行 rm -rf / 或 rm -rf ~。

管道和重定向¶

shell 的杀手级功能是可组合性：连接在一起的小命令可以完成复杂的事情。
管道 (|)：将一个命令的输出作为输入发送到下一个命令。

cat training.log | grep "loss" | tail -5    # last 5 lines containing "loss"
ps aux | grep python                        # find running Python processes
history | grep "docker"                     # find previous docker commands

重定向：将输出发送到文件而不是屏幕。

python train.py > output.log 2>&1    # stdout AND stderr to file
python train.py >> output.log        # append (don't overwrite)
echo "data" > file.txt               # overwrite file
echo "more" >> file.txt              # append to file

2>&1 将 stderr（文件描述符 2）重定向到 stdout（文件描述符 1）。如果没有它，错误消息仍然出现在屏幕上，而只有正常输出进入文件。

文本处理¶

grep "error" logfile.txt             # find lines containing "error"
grep -r "import torch" src/          # search recursively in directory
grep -i "warning" log.txt            # case-insensitive search
grep -c "epoch" train.log            # count matching lines

wc -l file.txt                       # count lines
wc -w file.txt                       # count words

sort data.txt                        # sort lines alphabetically
sort -n numbers.txt                  # sort numerically
sort -u data.txt                     # sort and remove duplicates
uniq -c sorted.txt                   # count consecutive duplicates

cut -d',' -f2,3 data.csv            # extract columns 2 and 3 from CSV
awk '{print $1, $3}' data.txt       # print 1st and 3rd whitespace-separated fields
sed 's/old/new/g' file.txt          # replace all occurrences of "old" with "new"

这些组成得很漂亮：

# Find the 10 most common error types in a log file
grep "ERROR" app.log | awk -F': ' '{print $2}' | sort | uniq -c | sort -rn | head -10

查找文件¶

find . -name "*.py"                  # find all Python files
find . -name "*.pyc" -delete         # find and delete compiled Python files
find /data -size +100M               # files larger than 100 MB
find . -mtime -1                     # files modified in the last 24 hours

which python                        # where is the python executable?
locate filename                      # fast file search (uses pre-built index)

文件系统层次结构¶

Linux 将所有内容组织在以 / 为根的单个树中：

目录	目的
`/`	整个file system的根
`/home/user`	您的个人文件、配置、项目
`/etc`	系统范围的 configuration 文件
`/usr`	用户程序、库、文档
`/usr/local`	本地安装的软件（不是来自package manager）
`/var`	可变数据：日志（`/var/log`）、数据库、缓存
`/tmp`	临时文件（重新启动时清除）
`/opt`	可选第三方软件
`/proc`	虚拟 file system 暴露 kernel 和进程信息
`/dev`	设备文件（此处显示磁盘、GPU）

对于 ML：您的 training 数据通常位于 /data 或 /home/user/data 中，models 位于 /home/user/models 中，而 CUDA 位于 /usr/local/cuda 中。 GPU 设备显示为 /dev/nvidia0、/dev/nvidia1 等。

文件权限¶

每个文件和目录对于三个用户类别都有三种权限类型：

允许	文件	目录
r（读）	查看内容	列表内容
w（写）	修改内容	在里面创建/删除文件
×（执行）	作为程序运行	输入（cd 进入）目录

三个用户类别：所有者 (u)、组 (g)、其他 (o)。

ls -l script.py
# -rwxr-xr-- 1 henry ml_team 2048 Mar 28 script.py
#  ^^^         owner permissions: rwx (read, write, execute)
#     ^^^      group permissions: r-x (read, execute, no write)
#        ^^^   others permissions: r-- (read only)

chmod 755 script.py       # owner=rwx, group=rx, others=rx
chmod +x script.py        # add execute permission for everyone
chmod u+w,g-w file.txt    # add write for owner, remove write for group
chown henry:ml_team file  # change owner and group

陷阱：顶部带有 #!/usr/bin/env python3 的 Python script 需要执行权限（chmod +x）才能作为 ./script.py 运行。没有它，您必须使用python3 script.py。

流程管理¶

进程是一个正在运行的程序（第 13 章）。 shell 为您提供了管理它们的工具：

ps aux                    # list all running processes
ps aux | grep python      # find Python processes
top                       # real-time process monitor (CPU, memory)
htop                      # better version of top (install separately)
nvidia-smi                # GPU usage (essential for ML)
watch -n 1 nvidia-smi     # refresh nvidia-smi every second

kill PID                  # gracefully terminate process
kill -9 PID               # force kill (use when graceful fails)
killall python            # kill all Python processes

# Run in background
python train.py &                    # run in background
nohup python train.py > log.txt &    # run in background, survive logout

nohup 对于 ML training 至关重要：没有它，关闭 SSH 连接会终止 training 作业。 nohup 将进程与 terminal 分离。
screen 和 tmux 是创建持久会话的 terminal 多路复用器。您可以在 tmux 会话中启动 training 作业，与 SSH 断开连接，稍后重新连接，并且会话（和 training）仍在运行。

tmux new -s training          # create named session
# ... start training ...
# Ctrl+B, then D              # detach from session
tmux attach -t training       # reattach later (even after SSH reconnect)
tmux ls                       # list sessions

包管理器¶

系统软件包（操作系统级软件）：

# Debian/Ubuntu
sudo apt update               # refresh package list
sudo apt install htop         # install a package
sudo apt upgrade              # upgrade all packages

# macOS
brew install wget             # install via Homebrew

Python 封装：

pip install torch             # install from PyPI
pip install -e .              # install current project in editable mode
pip install -r requirements.txt  # install from requirements file
pip freeze > requirements.txt    # export installed packages

# Conda (for complex dependencies like CUDA)
conda create -n myenv python=3.11
conda activate myenv
conda install pytorch torchvision cudatoolkit=12.1 -c pytorch

陷阱：永远不要将 pip install 进入系统 Python。始终使用虚拟环境（python -m venv env、conda create 或 uv venv）。系统Python由OS工具共享；破坏它可能会破坏您的系统。

环境变量¶

环境变量是所有程序都可以访问的键值对。他们在不更改代码的情况下配置行为。

export CUDA_VISIBLE_DEVICES=0,1    # use only GPUs 0 and 1
export PYTHONPATH=/home/user/src   # add to Python's import path
export WANDB_API_KEY=abc123        # API key for Weights & Biases

echo $PATH                         # see current PATH
export PATH=$PATH:/usr/local/cuda/bin  # add CUDA to PATH

.bashrc（或 .zshrc）：每次打开 shell 时都会运行命令。将您的 export 语句放在这里，以便它们持续存在。
.env 文件：由 python-dotenv 等工具加载的项目特定变量。将机密（API 密钥、数据库密码）保存在 .env 中，并将 .env 添加到 .gitignore。永远不要向 git 透露 commit 的秘密。

SSH（安全外壳）¶

SSH 通过加密通道将您连接到 remote 机器。这就是您访问云虚拟机、GPU 服务器和 HPC 集群的方式。

ssh user@hostname              # connect to remote machine
ssh -i ~/.ssh/key.pem user@ip  # connect with specific key
ssh -L 8888:localhost:8888 user@server  # port forwarding (Jupyter on remote)

SSH 密钥（公钥/私钥对）替换密码：

ssh-keygen -t ed25519          # generate key pair
ssh-copy-id user@server        # copy public key to server
# now you can SSH without typing a password

SSH config (~/.ssh/config) 保存连接详细信息：

Host gpu-server
    HostName 10.0.1.42
    User henry
    IdentityFile ~/.ssh/gpu_key
    LocalForward 8888 localhost:8888

现在 ssh gpu-server 自动连接所有这些设置。
scp 和 rsync 在机器之间传输文件：

scp model.pt user@server:/data/models/     # copy file to remote
scp -r user@server:/data/results/ ./       # copy directory from remote
rsync -avz --progress data/ user@server:/data/  # sync with progress (smarter than scp)

基本 ML 命令备忘单¶

# GPU monitoring
nvidia-smi                                   # GPU usage snapshot
watch -n 1 nvidia-smi                        # live monitoring
gpustat                                      # cleaner GPU overview (pip install gpustat)

# Training management
nohup python train.py > train.log 2>&1 &     # background training that survives logout
tail -f train.log                            # monitor training output
kill %1                                      # kill last background job

# Disk usage (datasets are huge)
df -h                                        # disk space on all mounts
du -sh /data/*                               # size of each item in /data
du -sh --max-depth=1 .                       # size of subdirectories

# Memory
free -h                                      # RAM usage
cat /proc/meminfo                            # detailed memory info

# Network
curl -O https://example.com/dataset.tar.gz   # download file
wget https://example.com/model.bin           # alternative downloader
curl -X POST http://localhost:8080/predict \
    -H "Content-Type: application/json" \
    -d '{"text": "hello"}'                   # test a model serving endpoint

# Archives
tar -czf archive.tar.gz directory/           # compress
tar -xzf archive.tar.gz                      # extract
zip -r archive.zip directory/                # zip
unzip archive.zip                            # unzip

# Quick data inspection
head -5 data.csv                             # first 5 lines of CSV
wc -l data.csv                               # count rows
cut -d',' -f1 data.csv | sort -u | wc -l    # count unique values in column 1