Linux cheat sheet for GPU pods
~40 commands you actually use after SSH-ing into a pod — files / permissions / archives / SCP / nvidia-smi GPU triage / killing processes
Alaya NeW GPU instances run Ubuntu 22.04 LTS by default. This page is the small set of commands you reach for after SSH-ing into a pod — not a Linux primer, just the ones you use every day.
Companion reads: Tmux for long-running jobs · GitHub access acceleration (China) · HF mirror acceleration
1. Files & directories
Change directory (cd)
cd / # root
cd ~ # home
cd - # last visited directory (used a lot)
cd ../ # parent
cd /data/checkpoints # any absolute pathHit
Tabto autocomplete paths — dataset paths run dozens of layers deep, autocompletion is muscle memory.
List contents (ls)
ls # current directory
ls -a # include hidden files (.cache / .gitignore etc.)
ls -lh # detailed list with human-readable sizes (120M / 4.0G)
ls -lhS # sort by size descending — find disk hogs
ls -lt # sort by mtime descending — find latest checkpointCreate / remove / move
mkdir -p /data/runs/exp-001/logs # -p creates intermediate directories
touch a.txt # create empty file
rm a.txt
rm -rf <dir> # recursive force-delete, no confirmation
mv old.pt new.pt # rename or move
cp -r src/ dst/ # recursive copy (common for checkpoints)rm -rf has no recycle bin. Before running rm -rf $TMP/* inside a pod, always echo $TMP first — empty variable + this command = catastrophe.
Find files
find . -name "*.safetensors" # find all model weights
find . -type f -size +1G # find files larger than 1G
find . -type f -mtime -1 # files modified in last 24h
find /data -name "checkpoint-*" -type d # find training checkpoint dirsPrint path
pwd # where am I?
realpath ./model.bin # resolve to absolute path (mount-point debug)2. Read files
cat config.json # dump entire file
less train.log # paginated view, q to quit, /keyword to search
head -50 train.log # first 50 lines
tail -50 train.log # last 50 lines
tail -f train.log # follow appended writes (watch training loss)tail -f train.log is one of the most-used commands during training — pair it with Tmux so SSH dropouts don't kill your view.
3. vim — three things are enough
vim has a whole book to learn. Inside a pod, you only need three:
vim config.yamlAfter it opens:
| Goal | How |
|---|---|
| Edit | press i to enter insert mode, then Esc to leave it |
| Save & quit | Esc → type :wq → Enter |
| Quit without saving | Esc → type :q! → Enter |
Plus 4 common moves:
| Goal | How |
|---|---|
| Jump to line 100 | :100 Enter |
Search for loss | /loss Enter, then n for next |
| Delete current line | press dd |
| Undo | press u |
If vim isn't your thing — just use VS Code Remote-SSH and edit pod files in your local editor.
4. Permissions (chmod)
chmod +x run.sh # add executable bit (most common)
chmod 755 run.sh # equivalent in numeric form
chmod -R 755 /data/scripts # recursiveOctal mnemonic: r=4, w=2, x=1 — sum them.
| Number | Meaning |
|---|---|
7 (4+2+1) | read + write + execute (rwx) |
6 (4+2) | read + write (rw-) |
5 (4+1) | read + execute (r-x) |
4 | read only (r--) |
The three digits in chmod 755 file map to owner / group / others.
ls -l shows permissions:
-rwxr-xr-x 1 user user 4096 May 3 12:00 run.sh
^-^^^^^^^^^
| | | |
| | | └── others: r-x
| | └──── group: r-x
| └────── owner: rwx
└──────── - means file, d means directory5. Archives
# tar (Linux native)
tar -czvf data.tar.gz data/ # pack + gzip
tar -xzvf data.tar.gz # extract here
tar -xzvf data.tar.gz -C /target/ # extract to specified directory
tar -tf data.tar.gz | head # list contents without extracting
# zip (cross-platform, useful when sharing weights with Windows users)
zip -r weights.zip checkpoint-*/
unzip weights.zip
unzip -l weights.zip # list contentsFlag mnemonic: c=create / x=extract / z=gzip / v=verbose / f=file.
6. File transfer (local ↔ pod)
Alaya NeW pods are reached via SSH. Suppose your connection info is ssh -p 31029 user@example.alayanew.com.
Local → pod
# single file
scp -P 31029 model.safetensors user@example.alayanew.com:/data/
# directory
scp -rP 31029 dataset/ user@example.alayanew.com:/data/Pod → local
scp -P 31029 user@example.alayanew.com:/data/checkpoint-1000.pt ./
scp -rP 31029 user@example.alayanew.com:/data/runs/exp-001/ ./⚠️
scpis slow and breaks on large files. For > 1 GB usersync -avzP— supports resume and only transfers what changed.
rsync -avzP -e "ssh -p 31029" /data/dataset/ user@example.alayanew.com:/data/dataset/Download from the public internet
wget https://example.com/file.tar.gz # simple download
wget -c https://example.com/file.tar.gz # resume (-c)
wget -O model.bin https://... # save with custom name
curl -LO https://example.com/file.tar.gz # curl equivalentGitHub releases via accelerator (China only): see GitHub access acceleration. HuggingFace weights via mirror: see HF mirror acceleration.
7. Processes & resources
System resources
free -h # memory (-h human-readable)
df -h # disk usage per mount point
df -h /data # specific mount
du -sh /data/* # size of each subdirectory under /data
top # live process list, q to quit, sorted by CPU
htop # nicer top, may need: apt install htopFind and kill processes
ps -ef | grep python # list all python processes
ps aux --sort=-%mem | head # top 10 by memory
kill <pid> # graceful
kill -9 <pid> # force kill (when graceful is ignored)
pkill -f train.py # match by command line8. GPU triage ⭐️
The high-frequency GPU cases — check usage, find the squatter process, free the card.
How to read nvidia-smi
nvidia-smi+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
+-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id | ... | Memory-Usage | GPU-Util | Pwr:Usage/Cap |
| 0 NVIDIA H100 80GB HBM3 | ... | | 78234MiB/81559MiB | 98% | 412W/700W |
+-------------------------------+----------------------+----------------------+
| Processes: |
| GPU GI CI PID Process name GPU Memory |
| 0 N/A N/A 1234 python train.py 78GiB |
+-----------------------------------------------------------------------------+| Field | Meaning |
|---|---|
Memory-Usage | used / total VRAM. Near the cap = OOM risk |
GPU-Util | compute utilization. < 30% means I/O-bound or batch size too small |
Pwr:Usage/Cap | current / max power. Sustained < 50% means the card isn't fully loaded |
Processes.PID | the PID holding the card — needed for kill |
Live refresh
nvidia-smi -l 1 # built-in 1-second refresh
watch -n 1 nvidia-smi # equivalent
nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv -l 1 # only specific columnsFind squatter and free the card
# 1. List PIDs holding GPUs
nvidia-smi
# 2. Suppose PID 1234 is a leftover training run
ps -ef | grep 1234 # confirm it's yours
kill -9 1234 # kill it
# 3. One-liner: kill all GPU users in this pod (use with care — kills running jobs too)
fuser -k /dev/nvidia*fuser -k /dev/nvidia* kills every process holding a GPU inside this pod. Use with care on shared pods.
Why is my training slow? (nvidia-smi reading guide)
| Symptom | Likely cause |
|---|---|
GPU-Util 0%, memory full | Process hung / waiting on data I/O |
GPU-Util flickers 0% ↔ 100% | DataLoader is slow — increase num_workers, set pin_memory=True |
GPU-Util sustained < 30% | Batch size too small / model is just small |
| OOM near memory cap | Reduce batch size / gradient checkpointing / switch to ZeRO-3 |
9. The "I can't remember what I did" lifesavers
history | tail -50 # last 50 commands (find that training command)
history | grep python # history filtered by python
!1234 # rerun history entry 1234
!! # rerun last command (e.g. sudo !!)
ctrl + r # reverse-search history (build muscle memory)
ctrl + c # interrupt current command
ctrl + z # suspend to background (resume with fg)
which python # which python is on PATH (venv debug)
echo $PATH # where the shell looks for executables
env | grep CUDA # CUDA-related env vars10. Long jobs detach from SSH (use Tmux)
GPU training runs hours to days. SSH dropping = job dies — always start long jobs inside Tmux:
tmux new -s train # new session
# … run training inside …
# Press ctrl+b then d to detach (job keeps running)
tmux attach -t train # reattach
tmux ls # list sessionsWhat's next
- Done with the cheat sheet → Tmux multi-session training
- Edit pod files in a real IDE → VS Code Remote-SSH
- Distributed training → Ray distributed PyTorch
Last updated on
