Skip to main content

Using PyTorch for fine-tuning in Virtual Kubernetes Services

Updated at: 2025-12-02 15:30:25

PyTorch is an open-source machine learning library widely used in both academia and industry, particularly in fields such as natural language processing (NLP), computer vision (CV), and reinforcement learning. Pytorch is often used together with Jupyter Notebook.

In this introductory example, we will walk you through using PyTorch to fine-tune a model in Virtual Kubernetes Services (VKS).

Prerequisites

This tutorial assumes that you have:

  • Installed the kubectl program on your local machine.
  • Created a VKS cluster. For detailed steps, see: Create VKS.
Note

Please make sure that VKS and the image repository that are to be used are created in the same AI Data Center (AIDC).

Tutorial Source Code

First, download the source files required for this tutorial.

File list

The list of files and their purposes for this tutorial are described below.

File NameDescription
DockerfileImage build file: used to build the Docker image
deployment-1node-1gpu.yamlDefines the Deployment resource: how to start and stop the Pod for a single-node, single-GPU setup
deployment-1node-2gpu.yamlDefines the Deployment resource: how to start and stop the Pod for a single-node, multi-GPU setup
deployment-2node-2gpu.yamlDefines the Deployment resource: how to start and stop the Pod for a multi-node, multi-GPU setup
llama_sftFine-tune example code as a folder

The "llama_sft" fine-tuning folder contains the following files. The files along with their purposes are described below.

File NameDescription
ds_config.jsonDeepspeed configuration file
sft_data.jsonfine-tuning dataset
llama_sft.pyPython script
llama_sft_1node_1gpu.shSingle-node, single-GPU fine-tuning script
llama_sft_1node_2gpu.shSingle-node, multi-GPU fine-tuning script
llama_sft_2node_2gpu_ds.shMulti-node, multi-GPU fine-tuning script (master)
llama_sft_2node_2gpu_ds2.shMulti-node, multi-GPU fine-tuning script

Dockerfile

Based on the PyTorch base image, build a custom docker image by installing additional Python packages (e.g., transformers, torch, peft, jupyterlab) and setting the work directory as /workspace, among other configuration.

Deployment

In this example, the deployment information that is specified in the three files "deployment-1node-1gpu.yaml", "deployment-1node-2gpu.yaml", and "deployment-2node-2gpu.yaml", defines how to start and stop Pods for single-node single-GPU, single-node multi-GPU, and multi-node multi-GPU scenarios, respectively.

apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-deploy-1node-1gpu
namespace: llama
spec:
replicas: 1
selector:
matchLabels:
app: llama
template:
metadata:
labels:
app: llama
spec:
restartPolicy: Always
containers:
- name: coding-dev-container
image: registry.hd-01.alayanew.com:8443/vc-huangxs/pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel
resources:
requests:
memory: "200Gi"
cpu: "64"
nvidia.com/gpu-h800: 1
rdma/rdma_shared_device_a: 1
rdma/rdma_shared_device_b: 1
limits:
memory: "200Gi"
cpu: "64"
nvidia.com/gpu-h800: 1
rdma/rdma_shared_device_a: 1
rdma/rdma_shared_device_b: 1
command: ["sh", "-c", "tail -f /dev/null"]
volumeMounts:
- name: workspace
mountPath: "/workspace"
subPath: "pytorch/workspace"
env:
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_IB_HCA
value: "ib7s"
imagePullSecrets:
- name: harbor-secret
volumes:
- name: workspace
persistentVolumeClaim:
claimName: pvc-capacity-userdata

This manifest instructs the Kubernetes control plane of the VKS to do the following:

  1. Ensure that only one Pod run at any given time. This is defined by the "spec.replicas" key-value pair in the manifest.
  2. Reserve GPU, CPU, and memory resources on the compute node where the Pod runs. Each Pod in the Kubernetes cluster is allocated one GPU, as specified by the "spec.template.spec.containers.resources.limits.nvidia.com/gpu-h800" key-value pair.
  3. Specify the image, as defined by the "spec.template.spec.containers.image" key-value pair.
  4. Specify the mount directory of the PVC, as defined by the "spec.template.spec.containers.volumeMounts" key-value pair.
  5. Specify the PVC itself, as defined under "spec.template.spec.volumes".
Note

When preparing the "jupyter_deploy.yaml" file, replace the following information with your own:

Variable NameDescriptionSourceExample
imageImage nameCustom imageregistry.hd-01.alayanew.com:8443/[user]/pytorch:2.3.1-cuda12.1-cudnn8-devel
resources.requests.[GPU]GPU resource infoVKSnvidia.com/gpu-h800

Procedure

Prepare the image

Note

Make sure that the image repository and the VKS cluster are created in the same AIDC.

In the commands below, replace the Dockerfile path, Harbor account/password, image name, and image repository address with your own values. USERNAME/PASSWORD: See the SMS notification you received when the image repository was created. IMAGE_REGISTRY_DOMAIN: See Using the Image Repository. IMAGE_REGISTRY_URL: With format "IMAGE_REGISTRY_DOMAIN/project".

# pull image
docker pull pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel

# build image
docker build -t pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel -f /path/to/Dockerfile .

# login
docker login IMAGE_REGISTRY_DOMAIN -u USERNAME -p PASSWORD

# tag
docker tag \
pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel \
IMAGE_REGISTRY_URL/pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel

# push
docker push IMAGE_REGISTRY_URL/pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel

Create basic Kubernetes resources

Note

In the commands below, replace the account, password, image name, and image_registry_url with your own.

# Specify VKS configuration
export KUBECONFIG="[/path/to/kubeconfig]"

# Create namespace
kubectl create namespace llama

# Create secret
kubectl create secret docker-registry harbor-secret \
--docker-server=registry.hd-01.alayanew.com:8443\
--docker-username="user" \
--docker-password="password" \
--docker-email="email" \
--namespace llama

Single-Node, single-GPU Fine-Tuning

Create the Deployment

kubectl create -f deployment-1node-1gpu.yaml
kubectl get all -n llama

image-20241220164300961

Prepare the scripts

On the host machine, run the following command to copy the fine-tuning scripts to the persistent directory. In this tutorial, the script directory is "/workspace/llama_sft".

kubectl cp [/path/to/llama_sft] pod/llama-deploy-1node-1gpu-6d77656b9f-bxfbc:/workspace/llama_sft

Enter the working directory in the Pod

Note

Replace the Pod name with the actual name of the Pod you created.

kubectl exec -it pod/llama-deploy-1node-1gpu-6d77656b9f-bxfbc  bash -n llama

cd llama_sft

ls -l

image-20241220165329560

Download the model

Download the model to the persistent directory to avoid needing to download it later. In this tutorial, the model directory is "/workspace/Meta-Llama-3-8B-Instruct".

pip install modelscope
modelscope download --model LLM-Research/Meta-Llama-3-8B-Instruct --local_dir /workspace/Meta-Llama-3-8B-Instruct

Run the single-node single-GPU fine-tuning script

bash llama_sft_1node_1gpu.sh

Training starts:

image-20241220171217300

Training in progress:

image-20241220171304448

Monitor GPU utilization

In a separate terminal, enter the Pod and run the following command:

watch -n 1 nvidia-smi

image-20241220171331837

Single-Node Multi-GPU Fine-Tuning

Create the Deployment

kubectl create -f deployment-1node-2gpu.yaml
kubectl get all -n llama

image-20241220172256444

Enter the working directory in the Pod

Note

Replace the Pod name with the actual name of the Pod you created.

# Pod-1
kubectl exec -it pod/llama-deploy-1node-2gpu-6d75f5457c-pn6sh bash -n llama

cd llama_sft

ls -l

image-20241220172453198

Run the single-node multi-GPU fine-tuning script

bash llama_sft_1node_2gpu.sh

Training starts:

image-20241220172742683

Training in progress:

image-20241220172856717

Monitor GPU utilization

In another terminal, enter the Pod and run the following commands:

kubectl exec -it pod/llama-deploy-1node-2gpu-6d75f5457c-pn6sh  bash -n llama
watch -n 1 nvidia-smi

image-20241220172815343

Multi-node, multi-GPU fine-tuning

Create the Deployment

kubectl create -f deployment-2node-2gpu.yaml
kubectl get all -n llama -o wide

image-20241220175602511

On the primary node, enter the working directory

Select one Pod as your primary node and enter the working directory to perform the configuration. In this example, the Pod with IP "172.29.203.147" is used as the primary node.

Note

Replace the Pod name with the actual name of the Pod you started.

# Pod-1
kubectl exec -it pod/llama-deploy-2node-2gpu-6d75f5457c-8z8mq bash -n llama

cd llama_sft

ls -l

image-20241220175723986

Modify the multi-node multi-GPU fine-tuning scripts

Modify the "master_addr" parameter in the "llama_sft_2node_2gpu_ds.sh" and "llama_sft_2node_2gpu_ds2.sh" scripts as shown below:

File "llama_sft_2node_2gpu_ds.sh"

image-20241220180110966

File "llama_sft_2node_2gpu_ds2.sh"

image-20241220180431568

Run the multi-node multi-GPU fine-tuning script in the primary node

bash llama_sft_2node_2gpu_ds.sh

image-20241220181208655

Run the multi-node multi-GPU fine-tuning script in the secondary node

# Open a new terminal
kubectl exec -it pod/llama-deploy-85678bfb74-sbdxc bash -n llam
cd llama_sft
bash llama_sft_2node_2gpu_ds2.sh

image-20241220181452467

Training in progress:

image-20241220181849834

Monitor GPU utilization

In two additional terminals, enter the primary and secondary nodes respectively and run the following commands to monitor GPU utilization.

Primary node:

kubectl exec -it pod/llama-deploy-2node-2gpu-6d75f5457c-8z8mq  bash -n llama
watch -n 1 nvidia-smi

image-20241220181631827

Secondary node:

kubectl exec -it pod/llama-deploy-1node-2gpu-6d75f5457c-pn6sh  bash -n llama
watch -n 1 nvidia-smi

image-20241220181650063