本文介绍如何使用云容器引擎的Gang scheduling能力,解决原生调度器无法支持All-or-Nothing作业调度的问题。
前提条件
已安装智算套件。
背景信息
Gang scheduling,是一种保证一组相关任务同步执行的调度策略,多个任务的作业调度时,要么全部成功,要么全部失败,这种调度场景,称作为Gang scheduling。其中一个经典使用场景是分布式机器学习训练:在大规模机器学习模型的训练中,数据可能被分布到多个节点上,每个节点都需要运行一个模型的副本。这些模型副本需要同时开始训练,以保证参数更新的同步。随着大规模和复杂的工作负载在Kubernetes上的普及,需要对应的调度策略适配这种场景,避免资源浪费和延迟。由于Kubernetes的核心调度器默认不支持Gang scheduling,使得一些工作负载无法很好地迁移至Kubernetes。为了适配这种场景,目前的云容器引擎基于调度器框架实现Gang scheduling功能,可以在云容器引擎中非常方便使用该能力。
功能介绍
为了实现All-or-Nothing的特性,首先需要将一组同时调度的Pod通过annotations标识出来,这个标识可称为PodGroup。提交作业的时候调度器可根据工作负载的相关annotations,获取调度的配置并进行调度。只有当集群资源满足该任务最少运行个数时,才会统一调度,否则作业将一直处于Pending状态。
使用方法
下面使用kubeflow的TFJob作为例子展示Gang scheduling的能力。
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: gang-example
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
schedulerName: roc # 指定使用智算调度器
containers:
- name: tensorflow
image: busybox:latest
imagePullPolicy: IfNotPresent
command: ["sleep", "30s"]
resources:
limits:
nvidia.com/gpu: 1
作业提交到集群后,可看到调度组件自动为这个任务创建PodGroup自定义资源对象:
[root@pm-b86b yaml]# kubectl get pg
NAME STATUS MINMEMBER RUNNINGS AGE
gang-example Running 2 21s
[root@pm-b86b yaml]# kubectl get pg gang-example -oyaml
apiVersion: scheduling.roc/v1beta1
kind: PodGroup
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gang-example","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}}
creationTimestamp: "2024-04-14T03:32:54Z"
generation: 5
name: gang-example
namespace: default
ownerReferences:
- apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: TFJob
name: gang-example
uid: c916fae7-bae1-448e-9e99-f02999f9d91c
resourceVersion: "40581201"
uid: c815b6a4-c258-4bb8-bcb6-526baaaa1dfa
spec:
minMember: 2
minResources:
nvidia.com/gpu: "2"
status:
conditions:
- lastTransitionTime: "2024-04-14T03:32:59Z"
message: '2/0 tasks in gang unschedulable: pod group is not ready, 2 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: 1c53b240-9b12-426e-a089-edf133a16b58
type: Unschedulable
- lastTransitionTime: "2024-04-14T03:33:19Z"
reason: tasks in gang are ready to be scheduled
status: "True"
transitionID: 2afbaf4b-5424-414c-b89a-a3416925b9b0
type: Scheduled
phase: Running
running: 2
关键字段
- minMember:minMember表示该podgroup下最少需要运行的pod或任务数量。如果集群资源不满足miniMember数量任务的运行需求,调度器将不会调度任何一个该podgroup 内的任务。
- queue:queue表示该podgroup所属的queue。queue必须提前已创建且状态为open。
- priorityClassName:priorityClassName表示该podgroup的优先级,用于调度器为该queue中所有podgroup进行调度时进行排序。system-node-critical和system-cluster-critical 是2个预留的值,表示最高优先级。不特别指定时,默认使用default优先级或zero优先级。
- minResources:minResources表示运行该podgroup所需要的最少资源。当集群可分配资源不满足minResources时,调度器将不会调度任何一个该podgroup内的任务。
- phase:phase表示该podgroup当前的状态。
- conditions:conditions表示该podgroup的具体状态日志,包含了podgroup生命周期中的关键事件。
检查运行状态
由于集群资源足够作业的所有pod运行,通过命令可知Pod已在运行中。
[root@pm-b86b yaml]# kubectl get po | grep gang
gang-example-worker-0 1/1 Running 0 31s
gang-example-worker-1 1/1 Running 0 31s
如果集群资源不足以让所有pod运行,则所有Pod都会调度失败,可通过PodGroup查看调度状态。
[root@pm-b86b yaml]# kubectl get pg gang-example -oyaml
apiVersion: scheduling.roc/v1beta1
kind: PodGroup
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gang-example","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":10,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}}
creationTimestamp: "2024-04-14T03:47:09Z"
generation: 4
name: gang-example
namespace: default
ownerReferences:
- apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: TFJob
name: gang-example
uid: 8caecc94-7220-4bbc-bde2-6c94fe478a35
resourceVersion: "40583543"
uid: 69034c3f-3c51-4159-b8db-8a965a3838f7
spec:
minMember: 10
minResources:
nvidia.com/gpu: "10"
status:
conditions:
- lastTransitionTime: "2024-04-14T03:47:40Z"
message: '10/0 tasks in gang unschedulable: pod group is not ready, 10 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: 3359ff1a-d558-4148-949f-f3f53f501a4c
type: Unschedulable
phase: Pending