Single-node: tmpfs vs Ceph storage benchmark

Single-GPU and single-node multi-GPU LLaMA Factory fine-tune — /dev/shm vs bulk storage end-to-end timing comparison

Storage choice has a real impact on fine-tuning throughput, system performance, and data management. This page benchmarks the same fine-tuning workload on two storage backends — the in-memory tmpfs at /dev/shm versus bulk storage — and compares model load time, dataset load time, and end-to-end fine-tune time to give you a concrete data point for storage selection.

Prerequisites

An Alaya NeW enterprise account (sign up via account registration if you don't have one yet)

Single-GPU experiment

1. Preparation

Download the model, prepare LLaMA Factory, and prepare the dataset.

2. Pick SwanLab as the visualizer

Generate a SwanLab API key, then add the install + login commands to start.sh (highlight ① below).

start.sh layout

3. Path A: stage model + dataset to `/dev/shm`

Keep the highlighted ② block to copy assets into /dev/shm:

rclone copy /workspace/model/Qwen2.5-VL-7B-Instruct \
  /dev/shm/llamafactory/model/Qwen2.5-VL-7B-Instruct --transfers 8 -P
rclone copy /workspace/LLaMA-Factory/data/images \
  /dev/shm/llamafactory/dataset/qa_images/images --transfers 8 -P

Run training and record model load time, dataset load time, end-to-end fine-tune time.

Train metrics (loss / learning_rate)

SwanLab Train — /dev/shm

System metrics (GPU / CPU)

SwanLab system metrics — /dev/shm

4. Tune dataloader_num_workers

Edit LLaMA Factory's /src/llamafactory/webui/runner.py. Between the highlighted lines, add dataloader_num_workers=12 and save. Re-run training.

runner.py edit

System metrics after the change:

SwanLab system metrics — tuned

Conclusion 1: dataloader_num_workers = (CPU cores − 1)

Set dataloader_num_workers to (CPU cores − 1). Example: with 13C 200G 1GPU allocated to Run Shell, set it to 12. This effectively raises GPU utilization during training.

5. Path B: keep model + dataset on bulk storage

Remove the rclone copy lines and use /workspace/... paths directly. Re-run the same training and record the three timing metrics.

Train metrics — bulk storage

SwanLab Train — bulk

System metrics — bulk storage

SwanLab system metrics — bulk

After tuning dataloader_num_workers=12:

SwanLab system metrics — bulk tuned

6. Single-GPU results

Setting	`/dev/shm` (100 GB)	Bulk storage
Epoch	10	10
Dataset	2000+ multimodal images (1928×1208)	2000+ multimodal images (1928×1208)
Tuning	LoRA	LoRA
Batchsize	2	2
Data load time / batchsize	1.42 s	1.64 s
Model load time (16 GB)	6.77 s	72.05 s
Fine-tune time	4:10:22	4:27:23

Note: /dev/shm capacity in elastic container clusters

Each GPU in an elastic container cluster ships with 200 GB of memory. /dev/shm defaults to half of that — 100 GB. /dev/shm is a tmpfs mount with very high read/write speed.

Conclusion 2: storage selection (single-GPU)

When the dataset is < 100 GB, /dev/shm:

Improves single-batch data load by 13%
Improves model load by roughly 10×
Cuts end-to-end fine-tune time by ~17 minutes

For sub-100-GB workloads, prefer staging both dataset and model into /dev/shm.

export WANDB_API_KEY=<your_api_key>
pip install wandb -i https://pypi.tuna.tsinghua.edu.cn/simple/
wandb login

start.sh — multi-GPU

3. Launch training

In the final.sh editor, right-click and select Run Shell. In the parameter dialog:

Resources: 52C 800G 4GPU
Expand Advanced → Add External Access → port 7860

Run Shell config

Click Submit to start training and record the three timing metrics.

System metrics after tuning:

Tuned 2a Tuned 2b Tuned 2c Tuned 2d

WanDB networking

Configure your network appropriately when using WanDB to ensure stable, performant logging.

6. Path B: bulk storage

Remove rclone copy and train against bulk storage. System metrics (sample):

Bulk monitor 1a Bulk monitor 1b Bulk monitor 1c Bulk monitor 1d

After dataloader_num_workers=52:

Bulk tuned 2a Bulk tuned 2b Bulk tuned 2c Bulk tuned 2d

Setting	`/dev/shm` (100 GB)	Bulk storage
Epoch	10	10
Dataset	2000+ multimodal images (1928×1208)	2000+ multimodal images (1928×1208)
Tuning	LoRA	LoRA
Batchsize	2	2
Data load time / batch (~6 MB)	9.53 s	1 min 41 s
Model load time (~16 GB)	5.75 s/batch	6.5 s/batch
Fine-tune time	1:07:38	1:16:44

Conclusion 3: storage selection (single-node multi-GPU)

When the dataset is < 100 GB:

Model load on /dev/shm is roughly 15× faster than bulk storage
Dataset load is also markedly faster
Fine-tune time is shorter by ~10 minutes

Overall, /dev/shm outperforms bulk storage by a wide margin.

Summary

Across single-GPU and single-node multi-GPU experiments:

For datasets and model files under 100 GB, staging into /dev/shm significantly improves training throughput and reduces end-to-end fine-tune time.
Setting dataloader_num_workers to (CPU cores − 1) further raises GPU utilization.

In practice, prefer /dev/shm for sub-100-GB workloads to get the best end-to-end performance.

Single-node: tmpfs vs Ceph storage benchmark

Prerequisites

Single-GPU experiment

1. Preparation

2. Pick SwanLab as the visualizer

3. Path A: stage model + dataset to `/dev/shm`

Train metrics (loss / learning_rate)

System metrics (GPU / CPU)

4. Tune dataloader_num_workers

5. Path B: keep model + dataset on bulk storage

Train metrics — bulk storage

System metrics — bulk storage

6. Single-GPU results

Single-node multi-GPU experiment

1. Preparation

2. Path A: stage to `/dev/shm`

3. Launch training

4. WanDB monitoring (/dev/shm)

Train tab

System tab (multi-GPU view)

5. Tune dataloader_num_workers

6. Path B: bulk storage

7. Side-by-side comparison

Train tab

System resources

8. Single-node multi-GPU results

Summary

Next steps

Multi-node DeepSpeed benchmark

Concepts

On this page