跳到主要内容

弹性容器集群部署PyTorch实例

PyTorch 是一个开源的机器学习库,在学术界和工业界都得到了广泛的应用,尤其是在自然语言处理(NLP)、计算机视觉(CV)、强化学习等领域。经常与 Jupyter Notebook 结合使用。

在这个简单的例子中,带有 Jupyter 的 PyTorch 实例被部署到 弹性容器集群 并使用 Kubernetes 公开到公共互联网。

前置条件

本教程假定您已经具备以下条件:

  • 在您的系统上安装了kubectl
  • 开通了Alaya NeW弹性容器集群,具体步骤参考:开通弹性容器集群

教程源代码

首先下载本教程所需要的源码文件

清单

本教程包含以下文件,以下是文件的作用说明。

文件名说明
Dockerfile镜像构建文件:用来构建docker镜像
config_harbor_secret.json配置与 Harbor 容器镜像仓库相关的敏感信息
jupyter_harbor_secret.yaml定义secret资源:在部署deployment资源时,用来拉取自定义镜像
jupyter_deploy.yaml定义Deployment资源:定义如何启停pod
jupyter_svc.yaml定义Service资源:处理网络和发布服务
jupyter_serviceexport.yaml定义ServiceExporter资源:将服务发布到公网

部署清单文件

Dockerfile

基于Pytorch的基础镜像,自定义镜像:

  1. 安装额外的 Python 包,例如:jupyterlab,pandas, matplotlib等;
  2. 设置工作目录为/workspace;
  3. 设置JUPYTER_DATA_DIR;
  4. 暴露 8888端口;
  5. 启动 JupyterLab。

image-20241211104300086

harbor配置信息

在本示例中,harbor镜像仓库的基本信息,由config_harbor_secret.json配置。

image-20241211104430072

注意: 编写config_harbor_secret.json文件时,请将一下信息替换为您自己的信息:

变量名说明来源示例
your_harbor_server镜像仓库访问域名资源中心/存储管理/镜像仓库界面https://registry.hd-01.alayanew.com:8443
username镜像仓库用户名开通短信user
password镜像仓库密码开通短信password
email你的邮箱abc@hello.com

密码

在本示例中,密码信息由jupyter_harbor_secret.yaml指定

Secret 是 Kubernetes 中的一种资源对象,用于存储和管理敏感信息(如密码、API 密钥、证书等),以确保这些信息不会暴露在配置文件或环境中。在本示例中创建secret来保护harbor的密码信息。

image-20241210185504397

注意: 编写jupyter_harbor_secret.yaml文件时,替换以下信息:

变量名说明来源示例
.dockerconfigjson使用base64对config_harbor_secret.json进行编码手动编码0ssdxkcjuielsdjf....

部署

在本示例中,部署信息由jupyter_deploy.yaml文件指定。

具体指示弹性容器集群的Kubernetes control plane以下信息:

  1. 确保在任何时候只有一个Pod运行。这个实例是通过清单中的 spec.replicas 键值对定义的。
  2. 在运行pod的弹性容器集群计算节点上预留GPU、CPU和内存资源。在Kubernetes Pod中运行的每个Jupyter实例分配了1个gpu,由下面的spec.template.spec.containers.resources.limits.nvidia.com/gpu-h800 键值对定义。
  3. 指定镜像,由 spec.template.spec.containers.image 键值对定义。
  4. 指定pvc的挂载目录,由 spec.template.spec.containers.volumeMounts 键值对定义。
  5. 指定pvc,由 spec.template.spec.volumes 定义

image-20250120110928821

注意: 编写jupyter_deploy.yaml文件时,请替换以下信息:

变量名说明来源示例
image镜像名称自定义镜像registry.hd-01.alayanew.com:8443/alayanew-dab57f9b-35f5-4dc1-afff-5cfd02esdsfe/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel-ssh-1.0
resources.requests.[GPU]GPU资源信息弹性容器集群/集群详情/算力配置nvidia.com/gpu-h800
volumes.persistentVolumeClaim.claimNamepvc名称默认创建的pvc,参考声明存储pvc-capacity-userdata

服务

在本示例中,服务信息由jupyter_svc.yaml文件指定。

在Kubernetes中,网络流量是通过服务处理的,这些服务通常由它们自己的资源定义。在这个例子中定义了一个服务,它允许将部署发布到公共Internet服务规定TCP端口8888将作为ClusterIP类型向公共互联网开放:

image-20241211104032063

发布服务

在本示例中,向公网发布服务由jupyter_serviceexport.yaml指定。

image-20241211112900196

操作步骤

镜像准备

注意: 以下命令中,请将账号,密码,镜像名称,镜像仓库地址等信息替换成你自己的。

# pull image
docker pull pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

# build image
docker build -t pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel -f [/path/to/Dockerfile] .

#login
docker login https://registry.hd-01.alayanew.com:8443/ -u [user] -p [passwd]

# tag
docker tag pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel [registry.hd-01.alayanew.com:8443/alayanew-dab57f9b-35f5-4dc1-afff-5cfd02esdsfe]/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel

# push
docker push [registry.hd-01.alayanew.com:8443/alayanew-dab57f9b-35f5-4dc1-afff-5cfd02esdsfe]/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel

k8s资源部署

# 声明弹性容器集群配置
export KUBECONFIG="[/path/to/kubeconfig]"

# 创建namespace
kubectl create namespace jupyter
namespace/jupyter created

# 创建secret
kubectl apply -f jupyter_harbor_secret.yaml
secret/harbor-secret created

# 创建deploy
kubectl apply -f jupyter_deploy.yaml
deployment.apps/jupyter-deploy created

# 创建svc
kubectl apply -f jupyter_svc.yaml
service/jupyter-svc created

# 创建serverexporter
kubectl apply -f jupyter_serviceexport.yaml
serviceexporter.osm.datacanvas.com/jupyter-svc created

查看各类资源

查看所有的pods

# kubectl get pods -n jupyter
NAME READY STATUS RESTARTS AGE
jupyter-deploy-576b8fb97d-hnmhq 1/1 Running 0 5m24s

查看deploy

# kubectl get deploy -n jupyter
NAME READY UP-TO-DATE AVAILABLE AGE
jupyter-deploy 1/1 1 1 5m55s

查看svc

# kubectl get svc -n jupyter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
jupyter-svc ClusterIP 10.68.199.68 <none> 8888/TCP 5m40s

查看pod信息

kubectl describe pod jupyter-deploy-576b8fb97d-hnmhq -n jupyter
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m37s default-scheduler Successfully assigned jupyter/jupyter-deploy-576b8fb97d-hnmhq to k8s-mas-gpu-8-78
Normal AddedInterface 6m35s multus Add eth0 [172.19.129.41/32] from k8s-pod-network
Normal Pulling 6m34s kubelet Pulling image "registry.hd-01.alayanew.com:8443/vc-huangxs/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel"
Normal Pulled 4m37s kubelet Successfully pulled image "registry.hd-01.alayanew.com:8443/vc-huangxs/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel" in 1m57.624319824s (1m57.624329826s including waiting)
Normal Created 4m37s kubelet Created container sd-cuda-container
Normal Started 4m37s kubelet Started container sd-cuda-container

获取jupyter登录令牌

查看Pod的日志以获取Jupyter登录令牌。令牌将作为所服务实例的URL的一部分显示。例如:

# kubectl logs  jupyter-deploy-576b8fb97d-hnmhq -n jupyter
.......
[I 2024-12-10 11:57:27.353 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-12-10 11:57:27.355 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-12-10 11:57:27.358 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-12-10 11:57:27.358 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2024-12-10 11:57:27.508 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-12-10 11:57:27.517 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-12-10 11:57:27.518 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-12-10 11:57:27.519 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-12-10 11:57:27.520 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.11/site-packages/jupyterlab
[I 2024-12-10 11:57:27.520 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 2024-12-10 11:57:27.520 LabApp] Extension Manager is 'pypi'.
[I 2024-12-10 11:57:27.549 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-12-10 11:57:27.549 ServerApp] Serving notebooks from local directory: /workspace
[I 2024-12-10 11:57:27.549 ServerApp] Jupyter Server 2.14.2 is running at:
[I 2024-12-10 11:57:27.549 ServerApp] http://jupyter-deploy-576b8fb97d-hnmhq:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
[I 2024-12-10 11:57:27.549 ServerApp] http://127.0.0.1:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
[I 2024-12-10 11:57:27.549 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-12-10 11:57:27.551 ServerApp]

To access the server, open this file in a browser:
file:///root/.local/share/jupyter/runtime/jpserver-1-open.html
Or copy and paste one of these URLs:
http://jupyter-deploy-576b8fb97d-hnmhq:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
http://127.0.0.1:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
.....

在这个示例中,令牌为:"a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7"

获取公网url

# kubectl get serviceexporter -n jupyter
NAME AGE
jupyter-se-svc 9m3s

# kubectl describe serviceexporter jupyter-se-svc -n jupyter
Name: jupyter-se-svc
Namespace: jupyter
Labels: <none>
Annotations: <none>
API Version: osm.datacanvas.com/v1alpha1
Kind: ServiceExporter
Metadata:
Creation Timestamp: 2024-12-11T06:02:39Z
Generation: 1
Resource Version: 20756
UID: c60c7e52-0703-40c0-80f4-205a2886a522
Spec:
Service Name: jupyter-svc
Service Port: 8888
Status:
Conditions:
Last Transition Time: 2024-12-11T06:02:39Z
Message: IngressRoute successfully updated, url: https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com
Reason: IngressRouteUpdated
Status: True
Type: Ready
Events: <none>


在这个示例中,公网的url为:https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com

通过ServiceExporter发布的服务,默认的端口为22443。

所以最终的url为:https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com:22443

登录jupyter

打开浏览器,使用从Pod的日志中获得的登录令牌导航到https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com:22443

image-20241211140903224

恭喜你!你已经在弹性容器集群上部署了一个运行Jupyter notebook的PyTorch实例!