Linux cheat sheet for GPU pods

~40 commands you actually use after SSH-ing into a pod — files / permissions / archives / SCP / nvidia-smi GPU triage / killing processes

Alaya NeW GPU instances run Ubuntu 22.04 LTS by default. This page is the small set of commands you reach for after SSH-ing into a pod — not a Linux primer, just the ones you use every day.

Companion reads: Tmux for long-running jobs · GitHub access acceleration (China) · HF mirror acceleration

1. Files & directories

Change directory (`cd`)

cd /                # root
cd ~                # home
cd -                # last visited directory (used a lot)
cd ../              # parent
cd /data/checkpoints  # any absolute path

Hit Tab to autocomplete paths — dataset paths run dozens of layers deep, autocompletion is muscle memory.

List contents (`ls`)

ls                  # current directory
ls -a               # include hidden files (.cache / .gitignore etc.)
ls -lh              # detailed list with human-readable sizes (120M / 4.0G)
ls -lhS             # sort by size descending — find disk hogs
ls -lt              # sort by mtime descending — find latest checkpoint

Create / remove / move

mkdir -p /data/runs/exp-001/logs    # -p creates intermediate directories
touch a.txt                          # create empty file
rm a.txt
rm -rf <dir>                         # recursive force-delete, no confirmation
mv old.pt new.pt                     # rename or move
cp -r src/ dst/                      # recursive copy (common for checkpoints)

rm -rf has no recycle bin. Before running rm -rf $TMP/* inside a pod, always echo $TMP first — empty variable + this command = catastrophe.

Find files

find . -name "*.safetensors"          # find all model weights
find . -type f -size +1G              # find files larger than 1G
find . -type f -mtime -1              # files modified in last 24h
find /data -name "checkpoint-*" -type d   # find training checkpoint dirs

Print path

pwd                                   # where am I?
realpath ./model.bin                  # resolve to absolute path (mount-point debug)

2. Read files

cat config.json                       # dump entire file
less train.log                        # paginated view, q to quit, /keyword to search
head -50 train.log                    # first 50 lines
tail -50 train.log                    # last 50 lines
tail -f train.log                     # follow appended writes (watch training loss)

tail -f train.log is one of the most-used commands during training — pair it with Tmux so SSH dropouts don't kill your view.

3. vim — three things are enough

vim has a whole book to learn. Inside a pod, you only need three:

vim config.yaml

After it opens:

Goal	How
Edit	press `i` to enter insert mode, then `Esc` to leave it
Save & quit	`Esc` → type `:wq` → Enter
Quit without saving	`Esc` → type `:q!` → Enter

Plus 4 common moves:

Goal	How
Jump to line 100	`:100` Enter
Search for `loss`	`/loss` Enter, then `n` for next
Delete current line	press `dd`
Undo	press `u`

If vim isn't your thing — just use VS Code Remote-SSH and edit pod files in your local editor.

4. Permissions (`chmod`)

chmod +x run.sh                      # add executable bit (most common)
chmod 755 run.sh                     # equivalent in numeric form
chmod -R 755 /data/scripts           # recursive

Octal mnemonic: r=4, w=2, x=1 — sum them.

Number	Meaning
`7` (4+2+1)	read + write + execute (rwx)
`6` (4+2)	read + write (rw-)
`5` (4+1)	read + execute (r-x)
`4`	read only (r--)

The three digits in chmod 755 file map to owner / group / others.

ls -l shows permissions:

-rwxr-xr-x  1 user user 4096 May  3 12:00 run.sh
^-^^^^^^^^^
| | | |
| | | └── others: r-x
| | └──── group:  r-x
| └────── owner:  rwx
└──────── - means file, d means directory

5. Archives

# tar (Linux native)
tar -czvf data.tar.gz data/          # pack + gzip
tar -xzvf data.tar.gz                # extract here
tar -xzvf data.tar.gz -C /target/    # extract to specified directory
tar -tf  data.tar.gz | head          # list contents without extracting

# zip (cross-platform, useful when sharing weights with Windows users)
zip -r weights.zip checkpoint-*/
unzip weights.zip
unzip -l weights.zip                 # list contents

Flag mnemonic: c=create / x=extract / z=gzip / v=verbose / f=file.

6. File transfer (local ↔ pod)

Alaya NeW pods are reached via SSH. Suppose your connection info is ssh -p 31029 user@example.alayanew.com.

Local → pod

# single file
scp -P 31029 model.safetensors user@example.alayanew.com:/data/

# directory
scp -rP 31029 dataset/ user@example.alayanew.com:/data/

Pod → local

scp -P 31029 user@example.alayanew.com:/data/checkpoint-1000.pt ./
scp -rP 31029 user@example.alayanew.com:/data/runs/exp-001/ ./

⚠️ scp is slow and breaks on large files. For > 1 GB use rsync -avzP — supports resume and only transfers what changed.

rsync -avzP -e "ssh -p 31029" /data/dataset/ user@example.alayanew.com:/data/dataset/

Download from the public internet

wget https://example.com/file.tar.gz             # simple download
wget -c https://example.com/file.tar.gz          # resume (-c)
wget -O model.bin https://...                    # save with custom name

curl -LO https://example.com/file.tar.gz         # curl equivalent

GitHub releases via accelerator (China only): see GitHub access acceleration. HuggingFace weights via mirror: see HF mirror acceleration.

7. Processes & resources

System resources

free -h               # memory (-h human-readable)
df -h                 # disk usage per mount point
df -h /data           # specific mount
du -sh /data/*        # size of each subdirectory under /data
top                   # live process list, q to quit, sorted by CPU
htop                  # nicer top, may need: apt install htop

Find and kill processes

ps -ef | grep python                 # list all python processes
ps aux --sort=-%mem | head           # top 10 by memory
kill <pid>                           # graceful
kill -9 <pid>                        # force kill (when graceful is ignored)
pkill -f train.py                    # match by command line

8. GPU triage ⭐️

The high-frequency GPU cases — check usage, find the squatter process, free the card.

How to read `nvidia-smi`

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10    Driver Version: 535.86.10    CUDA Version: 12.2     |
+-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        | ...  | Memory-Usage  | GPU-Util | Pwr:Usage/Cap |
|   0  NVIDIA H800A 80GB HBM3    | ...           |      | 78234MiB/81559MiB | 98%  | 412W/700W     |
+-------------------------------+----------------------+----------------------+
| Processes:                                                                   |
|  GPU   GI   CI  PID  Process name                            GPU Memory      |
|    0   N/A  N/A 1234  python train.py                        78GiB           |
+-----------------------------------------------------------------------------+

Field	Meaning
`Memory-Usage`	used / total VRAM. Near the cap = OOM risk
`GPU-Util`	compute utilization. < 30% means I/O-bound or batch size too small
`Pwr:Usage/Cap`	current / max power. Sustained < 50% means the card isn't fully loaded
`Processes.PID`	the PID holding the card — needed for `kill`

Live refresh

nvidia-smi -l 1                      # built-in 1-second refresh
watch -n 1 nvidia-smi                # equivalent
nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv -l 1   # only specific columns

Find squatter and free the card

# 1. List PIDs holding GPUs
nvidia-smi

# 2. Suppose PID 1234 is a leftover training run
ps -ef | grep 1234                   # confirm it's yours
kill -9 1234                         # kill it

# 3. One-liner: kill all GPU users in this pod (use with care — kills running jobs too)
fuser -k /dev/nvidia*

fuser -k /dev/nvidia* kills every process holding a GPU inside this pod. Use with care on shared pods.

Why is my training slow? (`nvidia-smi` reading guide)

Symptom	Likely cause
`GPU-Util 0%`, memory full	Process hung / waiting on data I/O
`GPU-Util` flickers 0% ↔ 100%	DataLoader is slow — increase `num_workers`, set `pin_memory=True`
`GPU-Util` sustained < 30%	Batch size too small / model is just small
OOM near memory cap	Reduce batch size / gradient checkpointing / switch to ZeRO-3

9. The "I can't remember what I did" lifesavers

history | tail -50                   # last 50 commands (find that training command)
history | grep python                # history filtered by python
!1234                                # rerun history entry 1234
!!                                   # rerun last command (e.g. sudo !!)
ctrl + r                             # reverse-search history (build muscle memory)
ctrl + c                             # interrupt current command
ctrl + z                             # suspend to background (resume with fg)
which python                         # which python is on PATH (venv debug)
echo $PATH                           # where the shell looks for executables
env | grep CUDA                      # CUDA-related env vars

10. Long jobs detach from SSH (use Tmux)

GPU training runs hours to days. SSH dropping = job dies — always start long jobs inside Tmux:

tmux new -s train                    # new session
# … run training inside …
# Press ctrl+b then d to detach (job keeps running)
tmux attach -t train                 # reattach
tmux ls                              # list sessions

What's next

Done with the cheat sheet → Tmux multi-session training
Edit pod files in a real IDE → VS Code Remote-SSH
Distributed training → Ray distributed PyTorch

Linux cheat sheet for GPU pods

On this page