Alaya NeW Cloud

Multi-node DeepSpeed: tmpfs vs Ceph storage benchmark

Multi-node multi-GPU DeepSpeed fine-tune — /dev/shm vs bulk storage end-to-end timing comparison

Extending the methodology from the single-node benchmark, this page brings the comparison to a multi-node, multi-GPU + DeepSpeed workload to confirm whether storage-medium choice has the same impact at distributed scale.

Prerequisites

Experiment

1. Preparation

Following the "multi-node multi-GPU (DeepSpeed)" preparation steps, download the model, prepare LLaMA Factory, and prepare the dataset.

2. Pick TensorBoard as the visualizer

Follow the "multi-node multi-GPU (DeepSpeed) preparation" guidance to get TensorBoard ready.

3. Path A: stage to /dev/shm

In mmmc_DS.sh, keep the rclone copy block to stage the model and dataset into /dev/shm:

rclone copy /workspace/model/Qwen2.5-VL-7B-Instruct \
  /dev/shm/llamafactory/model/Qwen2.5-VL-7B-Instruct --transfers 8 -P
rclone copy /workspace/LLaMA-Factory/data/images \
  /dev/shm/llamafactory/dataset/qa_images/images --transfers 8 -P

Run training and record model load time, dataset load time, end-to-end fine-tune time.

TensorBoard (loss / learning_rate)

TensorBoard — /dev/shm

VS Code Monitor (per-worker system metrics)

In VS Code, right-click the Running task → Monitor to inspect per-worker CPU, MEM, GPU usage rate, and GPU memory clock metrics.

Worker 1 monitor Worker 2 monitor

Worker 3 monitor

4. Path B: bulk storage

Remove the rclone copy block and let the model and dataset live on bulk storage. Re-run the same training and record the three metrics.

TensorBoard — bulk storage

TensorBoard — bulk

VS Code Monitor — bulk storage

Worker 1 monitor — bulk Worker 2 monitor — bulk

Results

Setting/dev/shm (100 GB)Bulk storage
Epoch1010
Dataset2000+ multimodal images (1928×1208)2000+ multimodal images (1928×1208)
TuningLoRALoRA
Batchsize22
Data load time / batch (~6 MB)6.5 s1 min 33 s
Model load time (16 GB)21 s/batch22 s/batch
Fine-tune time0:43:331:03:52

Note: /dev/shm capacity in elastic container clusters

Each GPU in an elastic container cluster ships with 200 GB of memory. /dev/shm defaults to half of that — 100 GB. /dev/shm is a tmpfs mount with very high read/write speed.

Conclusion

Under datasets < 100 GB:

  • Model load efficiency improves by 13.5×
  • Overall training efficiency improves by 32%
  • Data handling is steadier and faster

Under multi-node multi-GPU DeepSpeed, the recommendation holds: stage dataset and model into /dev/shm. The conclusion matches the single-node case.

Cross-experiment summary

Putting the three experiments side by side, the impact of storage choice trends consistently:

Scenario/dev/shm model loadBulk model loadSpeedup
Single-GPU6.77 s72.05 s~10×
Single-node multi-GPU5.75 s6.5 s~1.13×
Multi-node DeepSpeed21 s/batch22 s/batch~1.05×

The model-load gap narrows at multi-GPU scale because weights are sharded across cards, so each card reads less. Dataset load time and end-to-end fine-tune time still show a substantial gap.

Next steps

Last updated on

Was this page helpful?

On this page