前提条件
- 已开通包含GPU/NPU的Kubernetes集群。
- 已安装智算套件。
背景信息
本文演示如何提交一个PyTorch的分布式训练任务,相关的数据已经包含在容器镜像中。若用户自有模型或训练任务可自行下载数据集,通过使用CSI hpfs文件存储,通过PVC方式挂载进容器中使用。
操作步骤
- 进入云容器引擎控制台
- 点击左侧【集群】进入集群列表
- 点击使用的集群名称,进入集群
- 点击左侧【工作负载】->【自定义资源】,选择资源浏览器,找到kubeflow.org/v1/PyTorchJob ,选择命名空间,点击新增
- 在创建yaml中,填入以下信息后点击【创建】。
注意1、GPU和昇腾NPU申请资源类型不一样,请使用对应的模板
2、修改对应的镜像仓库地址前缀为对应资源池,可在容器镜像控制台查看,如武汉41,则修改{image_repo}为registry-vpc-crs-wuhan41.ctyun.cn
3、namespace: 要和界面选择的一致
GPU模板
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-sample-gpu-01
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: {image_repo}/library/kubeflow-examples-pytorch-dist-mnist:multi
imagePullPolicy: IfNotPresent
command:
- "python3"
- "/opt/mnist/src/mnist.py"
- "--epochs=10"
- "--backend=nccl"
env:
- name: PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION
value: python
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: {image_repo}/library/kubeflow-examples-pytorch-dist-mnist:multi
imagePullPolicy: IfNotPresent
command:
- "python3"
- "/opt/mnist/src/mnist.py"
- "--epochs=10"
- "--backend=nccl"
env:
- name: PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION
value: python
resources:
limits:
nvidia.com/gpu: 1
NPU模板
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-sample-npu-01
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: {image_repo}/library/kubeflow-examples-pytorch-dist-mnist:multi
imagePullPolicy: IfNotPresent
command:
- "bash"
- "-c"
args: ["source /usr/local/Ascend/ascend-toolkit/set_env.sh && python3 /opt/mnist/src/mnist.py --epochs=10 --backend=hccl"]
env:
- name: PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION
value: python
resources:
limits:
huawei.com/Ascend910: 1
requests:
huawei.com/Ascend910: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: {image_repo}/library/kubeflow-examples-pytorch-dist-mnist:multi
imagePullPolicy: IfNotPresent
command:
- "bash"
- "-c"
args: ["source /usr/local/Ascend/ascend-toolkit/set_env.sh && python3 /opt/mnist/src/mnist.py --epochs=10 --backend=hccl"]
env:
- name: PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION
value: python
resources:
limits:
huawei.com/Ascend910: 1
requests:
huawei.com/Ascend910: 1
- 查看运行状态:点击左侧【工作负载】->【容器组】,找到任务名为前缀的容器,点击名称,查看日志/监控等信息是否符合预期。