跳到主要内容

弹性容器集群部署PyTorch实例

PyTorch 是一个开源的机器学习库,在学术界和工业界都得到了广泛的应用,尤其是在自然语言处理(NLP)、计算机视觉(CV)、强化学习等领域。经常与 Jupyter Notebook 结合使用。

在这个简单的例子中,带有 Jupyter 的 PyTorch 实例被部署到 弹性容器集群 并使用 Kubernetes 公开到公共互联网。

前提条件

本教程假定您已经具备以下条件:

教程源代码

首先下载本教程所需要的源码文件

清单

本教程包含以下文件,以下是文件的作用说明。

文件名功能
Dockerfile该文件用来构建docker镜像。
config_harbor_secret.json配置与Harbor容器镜像仓库相关的敏感信息。
jupyter_harbor_secret.yaml定义secret资源:在部署deployment资源时,用来拉取自定义镜像。
jupyter_deploy.yaml定义Deployment资源,该资源定义如何启停Pod。
jupyter_svc.yaml定义Service资源,该资源处理网络和发布服务 。
jupyter_serviceexport.yaml定义ServiceExporter资源,该资源将服务发布到公网。

清单解析

  • Dockerfile:基于Pytorch的基础镜像,用来自定义镜像。
  • config_harbor_secret.json:在本示例中,该文件配置了harbor镜像仓库的基本信息,用户需将以下信息替换为实际信息。
变量名说明来源示例
your_harbor_server镜像仓库访问域名资源中心/存储管理/镜像仓库界面https://registry.hd-01.alayanew.com:8443
username镜像仓库用户名开通短信user
password镜像仓库密码开通短信password
email你的邮箱abc@hello.com
  • jupyter_harbor_secret.yaml:该文件配置了密码信息,用于存储和管理敏感信息(如密码、API 密钥、证书等)。用户需将以下信息替换为实际信息。
变量名说明来源示例
.dockerconfigjson使用base64对config_harbor_secret.json进行编码手动编码0ssdxkcjuielsdjf....
  • jupyter_deploy.yaml:指定部署信息。用户需将以下信息替换为实际信息。
变量名说明来源示例
image镜像名称自定义镜像registry.hd-01.alayanew.com:8443/alayanew-dab57f9b-35f5-4dc1-afff-5cfd02esdsfe/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel-ssh-1.0
resources.requests.[GPU]GPU资源信息弹性容器集群/集群详情/算力配置nvidia.com/gpu-h800
volumes.persistentVolumeClaim.claimNamepvc名称默认创建的pvc,参考声明存储pvc-capacity-userdata
  • jupyter_svc.yaml:指定服务信息。本示例定义了一个ClusterIP类型的服务,用于处理Kubernetes中的网络流量,开放TCP端口8888,旨在将部署发布到公共互联网。

image-20241211104032063

发布服务

在本示例中,向公网发布服务由jupyter_serviceexport.yaml指定。

操作步骤

镜像准备

注意: 以下命令中,请将账号,密码,镜像名称,镜像仓库地址等信息替换成你自己的。

# pull image
docker pull pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

# build image
docker build -t pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel -f [/path/to/Dockerfile] .

#login
docker login https://registry.hd-01.alayanew.com:8443/ -u [user] -p [passwd]

# tag
docker tag pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel [registry.hd-01.alayanew.com:8443/alayanew-dab57f9b-35f5-4dc1-afff-5cfd02esdsfe]/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel

# push
docker push [registry.hd-01.alayanew.com:8443/alayanew-dab57f9b-35f5-4dc1-afff-5cfd02esdsfe]/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel

k8s资源部署

# 声明弹性容器集群配置
export KUBECONFIG="[/path/to/kubeconfig]"

# 创建namespace
kubectl create namespace jupyter
namespace/jupyter created

# 创建secret
kubectl apply -f jupyter_harbor_secret.yaml
secret/harbor-secret created

# 创建deploy
kubectl apply -f jupyter_deploy.yaml
deployment.apps/jupyter-deploy created

# 创建svc
kubectl apply -f jupyter_svc.yaml
service/jupyter-svc created

# 创建serverexporter
kubectl apply -f jupyter_serviceexport.yaml
serviceexporter.osm.datacanvas.com/jupyter-svc created

查看各类资源

查看所有的pods

# kubectl get pods -n jupyter
NAME READY STATUS RESTARTS AGE
jupyter-deploy-576b8fb97d-hnmhq 1/1 Running 0 5m24s

查看deploy

# kubectl get deploy -n jupyter
NAME READY UP-TO-DATE AVAILABLE AGE
jupyter-deploy 1/1 1 1 5m55s

查看svc

# kubectl get svc -n jupyter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
jupyter-svc ClusterIP 10.68.199.68 <none> 8888/TCP 5m40s

查看pod信息

kubectl describe pod jupyter-deploy-576b8fb97d-hnmhq -n jupyter
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m37s default-scheduler Successfully assigned jupyter/jupyter-deploy-576b8fb97d-hnmhq to k8s-mas-gpu-8-78
Normal AddedInterface 6m35s multus Add eth0 [172.19.129.41/32] from k8s-pod-network
Normal Pulling 6m34s kubelet Pulling image "registry.hd-01.alayanew.com:8443/vc-huangxs/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel"
Normal Pulled 4m37s kubelet Successfully pulled image "registry.hd-01.alayanew.com:8443/vc-huangxs/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-jupyter-devel" in 1m57.624319824s (1m57.624329826s including waiting)
Normal Created 4m37s kubelet Created container sd-cuda-container
Normal Started 4m37s kubelet Started container sd-cuda-container

获取jupyter登录令牌

查看Pod的日志以获取Jupyter登录令牌。令牌将作为所服务实例的URL的一部分显示。例如:

# kubectl logs  jupyter-deploy-576b8fb97d-hnmhq -n jupyter
.......
[I 2024-12-10 11:57:27.353 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-12-10 11:57:27.355 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-12-10 11:57:27.358 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-12-10 11:57:27.358 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2024-12-10 11:57:27.508 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-12-10 11:57:27.517 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-12-10 11:57:27.518 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-12-10 11:57:27.519 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-12-10 11:57:27.520 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.11/site-packages/jupyterlab
[I 2024-12-10 11:57:27.520 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 2024-12-10 11:57:27.520 LabApp] Extension Manager is 'pypi'.
[I 2024-12-10 11:57:27.549 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-12-10 11:57:27.549 ServerApp] Serving notebooks from local directory: /workspace
[I 2024-12-10 11:57:27.549 ServerApp] Jupyter Server 2.14.2 is running at:
[I 2024-12-10 11:57:27.549 ServerApp] http://jupyter-deploy-576b8fb97d-hnmhq:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
[I 2024-12-10 11:57:27.549 ServerApp] http://127.0.0.1:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
[I 2024-12-10 11:57:27.549 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-12-10 11:57:27.551 ServerApp]

To access the server, open this file in a browser:
file:///root/.local/share/jupyter/runtime/jpserver-1-open.html
Or copy and paste one of these URLs:
http://jupyter-deploy-576b8fb97d-hnmhq:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
http://127.0.0.1:8888/lab?token=a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7
.....

在这个示例中,令牌为:"a346695eeadcc8ee7d8a03ea1a691af9a8d7aa3ff889f1d7"

获取公网url

# kubectl get serviceexporter -n jupyter
NAME AGE
jupyter-se-svc 9m3s

# kubectl describe serviceexporter jupyter-se-svc -n jupyter
Name: jupyter-se-svc
Namespace: jupyter
Labels: <none>
Annotations: <none>
API Version: osm.datacanvas.com/v1alpha1
Kind: ServiceExporter
Metadata:
Creation Timestamp: 2024-12-11T06:02:39Z
Generation: 1
Resource Version: 20756
UID: c60c7e52-0703-40c0-80f4-205a2886a522
Spec:
Service Name: jupyter-svc
Service Port: 8888
Status:
Conditions:
Last Transition Time: 2024-12-11T06:02:39Z
Message: IngressRoute successfully updated, url: https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com
Reason: IngressRouteUpdated
Status: True
Type: Ready
Events: <none>


在这个示例中,公网的url为:https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com

通过ServiceExporter发布的服务,默认的端口为22443。

所以最终的url为:https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com:22443

登录jupyter

打开浏览器,使用从Pod的日志中获得的登录令牌导航到https://jupyter-svc-x-jupyter-x-vc2qofwoe524.sproxy.hd-01.alayanew.com:22443

image-20241211140903224

恭喜你!你已经在弹性容器集群上部署了一个运行Jupyter notebook的PyTorch实例!