Using PyTorch for fine-tuning in Virtual Kubernetes Services

Updated at: 2025-12-02 15:30:25

PyTorch is an open-source machine learning library widely used in both academia and industry, particularly in fields such as natural language processing (NLP), computer vision (CV), and reinforcement learning. Pytorch is often used together with Jupyter Notebook.

In this introductory example, we will walk you through using PyTorch to fine-tune a model in Virtual Kubernetes Services (VKS).

Prerequisites

This tutorial assumes that you have:

Installed the kubectl program on your local machine.
Created a VKS cluster. For detailed steps, see: Create VKS.

Note

Please make sure that VKS and the image repository that are to be used are created in the same AI Data Center (AIDC).

Tutorial Source Code

First, download the source files required for this tutorial.

File list

The list of files and their purposes for this tutorial are described below.

File Name	Description
Dockerfile	Image build file: used to build the Docker image
deployment-1node-1gpu.yaml	Defines the Deployment resource: how to start and stop the Pod for a single-node, single-GPU setup
deployment-1node-2gpu.yaml	Defines the Deployment resource: how to start and stop the Pod for a single-node, multi-GPU setup
deployment-2node-2gpu.yaml	Defines the Deployment resource: how to start and stop the Pod for a multi-node, multi-GPU setup
llama_sft	Fine-tune example code as a folder

The "llama_sft" fine-tuning folder contains the following files. The files along with their purposes are described below.

File Name	Description
ds_config.json	Deepspeed configuration file
sft_data.json	fine-tuning dataset
llama_sft.py	Python script
llama_sft_1node_1gpu.sh	Single-node, single-GPU fine-tuning script
llama_sft_1node_2gpu.sh	Single-node, multi-GPU fine-tuning script
llama_sft_2node_2gpu_ds.sh	Multi-node, multi-GPU fine-tuning script (master)
llama_sft_2node_2gpu_ds2.sh	Multi-node, multi-GPU fine-tuning script

Dockerfile

Based on the PyTorch base image, build a custom docker image by installing additional Python packages (e.g., transformers, torch, peft, jupyterlab) and setting the work directory as /workspace, among other configuration.

Deployment

In this example, the deployment information that is specified in the three files "deployment-1node-1gpu.yaml", "deployment-1node-2gpu.yaml", and "deployment-2node-2gpu.yaml", defines how to start and stop Pods for single-node single-GPU, single-node multi-GPU, and multi-node multi-GPU scenarios, respectively.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-deploy-1node-1gpu
  namespace: llama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama
  template:
    metadata:
      labels:
        app: llama
    spec:
      restartPolicy: Always
      containers:
        - name: coding-dev-container
          image: registry.hd-01.alayanew.com:8443/vc-huangxs/pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel
          resources:
            requests:
              memory: "200Gi"
              cpu: "64"
              nvidia.com/gpu-h800: 1
              rdma/rdma_shared_device_a: 1
              rdma/rdma_shared_device_b: 1
            limits:
              memory: "200Gi"
              cpu: "64"
              nvidia.com/gpu-h800: 1
              rdma/rdma_shared_device_a: 1
              rdma/rdma_shared_device_b: 1
          command: ["sh", "-c", "tail -f /dev/null"]
          volumeMounts:
            - name: workspace
              mountPath: "/workspace"
              subPath: "pytorch/workspace"
          env:
            - name: NCCL_IB_DISABLE
              value: "0"
            - name: NCCL_SOCKET_IFNAME
              value: "eth0"
            - name: NCCL_IB_HCA
              value: "ib7s"
      imagePullSecrets:
        - name: harbor-secret
      volumes:
        - name: workspace
          persistentVolumeClaim:
            claimName: pvc-capacity-userdata

This manifest instructs the Kubernetes control plane of the VKS to do the following:

Ensure that only one Pod run at any given time. This is defined by the "spec.replicas" key-value pair in the manifest.
Reserve GPU, CPU, and memory resources on the compute node where the Pod runs. Each Pod in the Kubernetes cluster is allocated one GPU, as specified by the "spec.template.spec.containers.resources.limits.nvidia.com/gpu-h800" key-value pair.
Specify the image, as defined by the "spec.template.spec.containers.image" key-value pair.
Specify the mount directory of the PVC, as defined by the "spec.template.spec.containers.volumeMounts" key-value pair.
Specify the PVC itself, as defined under "spec.template.spec.volumes".

Note

When preparing the "jupyter_deploy.yaml" file, replace the following information with your own:

Variable Name	Description	Source	Example
image	Image name	Custom image	registry.hd-01.alayanew.com:8443/[user]/pytorch:2.3.1-cuda12.1-cudnn8-devel
resources.requests.[GPU]	GPU resource info	VKS	nvidia.com/gpu-h800

Procedure

Prepare the image

Note

Make sure that the image repository and the VKS cluster are created in the same AIDC.

In the commands below, replace the Dockerfile path, Harbor account/password, image name, and image repository address with your own values. USERNAME/PASSWORD: See the SMS notification you received when the image repository was created. IMAGE_REGISTRY_DOMAIN: See Using the Image Repository. IMAGE_REGISTRY_URL: With format "IMAGE_REGISTRY_DOMAIN/project".

# pull image
docker pull pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel

# build image
docker build -t pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel -f /path/to/Dockerfile .

# login
docker login  IMAGE_REGISTRY_DOMAIN -u USERNAME -p PASSWORD

# tag
docker tag \
  pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel \
  IMAGE_REGISTRY_URL/pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel

# push
docker push IMAGE_REGISTRY_URL/pytorch:2.3.1-cuda12.1-cudnn8-pyton310-transformers4.41.2-devel

Create basic Kubernetes resources

Note

In the commands below, replace the account, password, image name, and image_registry_url with your own.

# Specify VKS configuration
export KUBECONFIG="[/path/to/kubeconfig]"

# Create namespace
kubectl create namespace llama

# Create secret
kubectl create secret docker-registry harbor-secret \
  --docker-server=registry.hd-01.alayanew.com:8443\
  --docker-username="user"  \
  --docker-password="password" \
  --docker-email="email" \
  --namespace llama

Single-Node, single-GPU Fine-Tuning

Create the Deployment

kubectl create -f deployment-1node-1gpu.yaml
kubectl get all -n llama

Prepare the scripts

On the host machine, run the following command to copy the fine-tuning scripts to the persistent directory. In this tutorial, the script directory is "/workspace/llama_sft".

kubectl cp [/path/to/llama_sft] pod/llama-deploy-1node-1gpu-6d77656b9f-bxfbc:/workspace/llama_sft

Enter the working directory in the Pod

Note

Replace the Pod name with the actual name of the Pod you created.

kubectl exec -it pod/llama-deploy-1node-1gpu-6d77656b9f-bxfbc  bash -n llama

cd llama_sft

ls -l

Download the model

Download the model to the persistent directory to avoid needing to download it later. In this tutorial, the model directory is "/workspace/Meta-Llama-3-8B-Instruct".

pip install modelscope
modelscope download --model LLM-Research/Meta-Llama-3-8B-Instruct --local_dir /workspace/Meta-Llama-3-8B-Instruct

Run the single-node single-GPU fine-tuning script

bash llama_sft_1node_1gpu.sh

Training starts:

Training in progress:

Monitor GPU utilization

In a separate terminal, enter the Pod and run the following command:

watch -n 1 nvidia-smi

Single-Node Multi-GPU Fine-Tuning

Create the Deployment

kubectl create -f deployment-1node-2gpu.yaml
kubectl get all -n llama

Enter the working directory in the Pod

Note

Replace the Pod name with the actual name of the Pod you created.

# Pod-1
kubectl exec -it pod/llama-deploy-1node-2gpu-6d75f5457c-pn6sh  bash -n llama

cd llama_sft

ls -l 

Run the single-node multi-GPU fine-tuning script

bash llama_sft_1node_2gpu.sh

Training starts:

Training in progress:

Monitor GPU utilization

In another terminal, enter the Pod and run the following commands:

kubectl exec -it pod/llama-deploy-1node-2gpu-6d75f5457c-pn6sh  bash -n llama
watch -n 1 nvidia-smi

Multi-node, multi-GPU fine-tuning

Create the Deployment

kubectl create -f deployment-2node-2gpu.yaml
kubectl get all -n llama -o wide

On the primary node, enter the working directory

Select one Pod as your primary node and enter the working directory to perform the configuration. In this example, the Pod with IP "172.29.203.147" is used as the primary node.

Note

Replace the Pod name with the actual name of the Pod you started.

# Pod-1
kubectl exec -it pod/llama-deploy-2node-2gpu-6d75f5457c-8z8mq   bash -n llama

cd llama_sft

ls -l 

Modify the multi-node multi-GPU fine-tuning scripts

Modify the "master_addr" parameter in the "llama_sft_2node_2gpu_ds.sh" and "llama_sft_2node_2gpu_ds2.sh" scripts as shown below:

File "llama_sft_2node_2gpu_ds.sh"

File "llama_sft_2node_2gpu_ds2.sh"

Run the multi-node multi-GPU fine-tuning script in the primary node

bash llama_sft_2node_2gpu_ds.sh

Run the multi-node multi-GPU fine-tuning script in the secondary node

# Open a new terminal
kubectl exec -it pod/llama-deploy-85678bfb74-sbdxc  bash -n llam
cd llama_sft
bash llama_sft_2node_2gpu_ds2.sh

Training in progress:

Monitor GPU utilization

In two additional terminals, enter the primary and secondary nodes respectively and run the following commands to monitor GPU utilization.

Primary node:

kubectl exec -it pod/llama-deploy-2node-2gpu-6d75f5457c-8z8mq  bash -n llama
watch -n 1 nvidia-smi

Secondary node:

kubectl exec -it pod/llama-deploy-1node-2gpu-6d75f5457c-pn6sh  bash -n llama
watch -n 1 nvidia-smi

Prerequisites​

Tutorial Source Code​

File list​

Dockerfile​

Deployment​

Procedure​

Prepare the image​

Create basic Kubernetes resources​

Single-Node, single-GPU Fine-Tuning​

Create the Deployment​

Prepare the scripts​

Enter the working directory in the Pod​

Download the model​

Run the single-node single-GPU fine-tuning script​

Monitor GPU utilization​

Single-Node Multi-GPU Fine-Tuning​

Create the Deployment​

Enter the working directory in the Pod​

Run the single-node multi-GPU fine-tuning script​

Monitor GPU utilization​

Multi-node, multi-GPU fine-tuning​

Create the Deployment​

On the primary node, enter the working directory​

Modify the multi-node multi-GPU fine-tuning scripts​

Run the multi-node multi-GPU fine-tuning script in the primary node​

Run the multi-node multi-GPU fine-tuning script in the secondary node​

Monitor GPU utilization​

Prerequisites

Tutorial Source Code

File list

Dockerfile

Deployment

Procedure

Prepare the image

Create basic Kubernetes resources

Single-Node, single-GPU Fine-Tuning

Create the Deployment

Prepare the scripts

Enter the working directory in the Pod

Download the model

Run the single-node single-GPU fine-tuning script

Monitor GPU utilization

Single-Node Multi-GPU Fine-Tuning

Create the Deployment

Enter the working directory in the Pod

Run the single-node multi-GPU fine-tuning script

Monitor GPU utilization

Multi-node, multi-GPU fine-tuning

Create the Deployment

On the primary node, enter the working directory

Modify the multi-node multi-GPU fine-tuning scripts

Run the multi-node multi-GPU fine-tuning script in the primary node

Run the multi-node multi-GPU fine-tuning script in the secondary node

Monitor GPU utilization