Alaya NeW Cloud

Single-node: tmpfs vs Ceph storage benchmark

Single-GPU and single-node multi-GPU LLaMA Factory fine-tune — /dev/shm vs bulk storage end-to-end timing comparison

Storage choice has a real impact on fine-tuning throughput, system performance, and data management. This page benchmarks the same fine-tuning workload on two storage backends — the in-memory tmpfs at /dev/shm versus bulk storage — and compares model load time, dataset load time, and end-to-end fine-tune time to give you a concrete data point for storage selection.

Prerequisites

Single-GPU experiment

1. Preparation

Download the model, prepare LLaMA Factory, and prepare the dataset.

2. Pick SwanLab as the visualizer

Generate a SwanLab API key, then add the install + login commands to start.sh (highlight ① below).

start.sh layout

3. Path A: stage model + dataset to /dev/shm

Keep the highlighted ② block to copy assets into /dev/shm:

rclone copy /workspace/model/Qwen2.5-VL-7B-Instruct \
  /dev/shm/llamafactory/model/Qwen2.5-VL-7B-Instruct --transfers 8 -P
rclone copy /workspace/LLaMA-Factory/data/images \
  /dev/shm/llamafactory/dataset/qa_images/images --transfers 8 -P

Run training and record model load time, dataset load time, end-to-end fine-tune time.

Train metrics (loss / learning_rate)

SwanLab Train — /dev/shm

System metrics (GPU / CPU)

SwanLab system metrics — /dev/shm

4. Tune dataloader_num_workers

Edit LLaMA Factory's /src/llamafactory/webui/runner.py. Between the highlighted lines, add dataloader_num_workers=12 and save. Re-run training.

runner.py edit

System metrics after the change:

SwanLab system metrics — tuned

Conclusion 1: dataloader_num_workers = (CPU cores − 1)

Set dataloader_num_workers to (CPU cores − 1). Example: with 13C 200G 1GPU allocated to Run Shell, set it to 12. This effectively raises GPU utilization during training.

5. Path B: keep model + dataset on bulk storage

Remove the rclone copy lines and use /workspace/... paths directly. Re-run the same training and record the three timing metrics.

Train metrics — bulk storage

SwanLab Train — bulk

System metrics — bulk storage

SwanLab system metrics — bulk

After tuning dataloader_num_workers=12:

SwanLab system metrics — bulk tuned

6. Single-GPU results

Setting/dev/shm (100 GB)Bulk storage
Epoch1010
Dataset2000+ multimodal images (1928×1208)2000+ multimodal images (1928×1208)
TuningLoRALoRA
Batchsize22
Data load time / batchsize1.42 s1.64 s
Model load time (16 GB)6.77 s72.05 s
Fine-tune time4:10:224:27:23

Note: /dev/shm capacity in elastic container clusters

Each GPU in an elastic container cluster ships with 200 GB of memory. /dev/shm defaults to half of that — 100 GB. /dev/shm is a tmpfs mount with very high read/write speed.

Conclusion 2: storage selection (single-GPU)

When the dataset is < 100 GB, /dev/shm:

  • Improves single-batch data load by 13%
  • Improves model load by roughly 10×
  • Cuts end-to-end fine-tune time by ~17 minutes

For sub-100-GB workloads, prefer staging both dataset and model into /dev/shm.

Single-node multi-GPU experiment

1. Preparation

Following "single-node multi-GPU example", download the model, prepare LLaMA Factory, and prepare the dataset.

2. Path A: stage to /dev/shm

In start.sh, configure WanDB (highlight ①) and keep the rclone copy block (highlight ②) to stage assets to /dev/shm:

export WANDB_API_KEY=<your_api_key>
pip install wandb -i https://pypi.tuna.tsinghua.edu.cn/simple/
wandb login

start.sh — multi-GPU

3. Launch training

In the final.sh editor, right-click and select Run Shell. In the parameter dialog:

  • Resources: 52C 800G 4GPU
  • Expand AdvancedAdd External Access → port 7860

Run Shell config

Click Submit to start training and record the three timing metrics.

4. WanDB monitoring (/dev/shm)

Train tab

WanDB Train — /dev/shm

System tab (multi-GPU view)

Monitor 1a Monitor 1b Monitor 1c Monitor 1d

5. Tune dataloader_num_workers

In /src/llamafactory/webui/runner.py, set dataloader_num_workers=51 (CPU cores − 1 for 4-GPU 52C).

runner.py edit

System metrics after tuning:

Tuned 2a Tuned 2b Tuned 2c Tuned 2d

WanDB networking

Configure your network appropriately when using WanDB to ensure stable, performant logging.

6. Path B: bulk storage

Remove rclone copy and train against bulk storage. System metrics (sample):

Bulk monitor 1a Bulk monitor 1b Bulk monitor 1c Bulk monitor 1d

After dataloader_num_workers=52:

Bulk tuned 2a Bulk tuned 2b Bulk tuned 2c Bulk tuned 2d

7. Side-by-side comparison

WanDB supports cross-project metric comparisons: green = /dev/shm, red = bulk storage.

Train tab

Train comparison

System resources

System compare a System compare b System compare c System compare d

8. Single-node multi-GPU results

Setting/dev/shm (100 GB)Bulk storage
Epoch1010
Dataset2000+ multimodal images (1928×1208)2000+ multimodal images (1928×1208)
TuningLoRALoRA
Batchsize22
Data load time / batch (~6 MB)9.53 s1 min 41 s
Model load time (~16 GB)5.75 s/batch6.5 s/batch
Fine-tune time1:07:381:16:44

Conclusion 3: storage selection (single-node multi-GPU)

When the dataset is < 100 GB:

  • Model load on /dev/shm is roughly 15× faster than bulk storage
  • Dataset load is also markedly faster
  • Fine-tune time is shorter by ~10 minutes

Overall, /dev/shm outperforms bulk storage by a wide margin.

Summary

Across single-GPU and single-node multi-GPU experiments:

  • For datasets and model files under 100 GB, staging into /dev/shm significantly improves training throughput and reduces end-to-end fine-tune time.
  • Setting dataloader_num_workers to (CPU cores − 1) further raises GPU utilization.

In practice, prefer /dev/shm for sub-100-GB workloads to get the best end-to-end performance.

Next steps

Last updated on

Was this page helpful?

On this page