通过使用 Gang scheduling 能力，可有效解决原生调度器无法支持 All-or-Nothing 作业调度的问题。

前提条件

已安装智算套件。

背景信息

Gang scheduling 是一种保证一组相关任务同步执行的调度策略，多个任务的作业调度时，要么全部成功，要么全部失败，这种调度场景，称作为Gang scheduling。其中一个经典使用场景是分布式机器学习训练：在大规模机器学习模型的训练中，数据可能被分布到多个节点上，每个节点都需要运行一个模型的副本。这些模型副本需要同时开始训练，以保证参数更新的同步。随着大规模和复杂的工作负载在Kubernetes上的普及，需要对应的调度策略适配这种场景，避免资源浪费和延迟。由于Kubernetes的核心调度器默认不支持Gang scheduling，使得一些工作负载无法很好地迁移至Kubernetes。为了适配这种场景，目前的云容器引擎基于调度器框架实现Gang scheduling功能，可以在云容器引擎中非常方便使用该能力。

功能介绍

为了实现All-or-Nothing的特性，首先需要将一组同时调度的Pod通过annotations标识出来，这个标识可称为PodGroup。提交作业的时候调度器可根据工作负载的相关annotations，获取调度的配置并进行调度。只有当集群资源满足该任务最少运行个数时，才会统一调度，否则作业将一直处于Pending状态。

使用方法

下面使用kubeflow的TFJob作为例子展示Gang scheduling的能力。

apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: gang-example
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          schedulerName: roc  # 指定使用智算调度器
          containers:
            - name: tensorflow
              image: busybox:latest
              imagePullPolicy: IfNotPresent
              command: ["sleep", "30s"]
              resources:
                limits:
                  nvidia.com/gpu: 1

作业提交到集群后，可看到调度组件自动为这个任务创建PodGroup自定义资源对象：

[root@pm-b86b yaml]# kubectl get pg
NAME           STATUS    MINMEMBER   RUNNINGS   AGE
gang-example   Running   2                      21s

[root@pm-b86b yaml]# kubectl get pg gang-example -oyaml
apiVersion: scheduling.roc/v1beta1
kind: PodGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gang-example","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}}
  creationTimestamp: "2024-04-14T03:32:54Z"
  generation: 5
  name: gang-example
  namespace: default
  ownerReferences:

lastTransitionTime: "2024-04-14T03:33:19Z"
reason: tasks in gang are ready to be scheduled
status: "True"
transitionID: 2afbaf4b-5424-414c-b89a-a3416925b9b0
type: Scheduled
phase: Running
running: 2

关键字段

minMember：minMember表示该podgroup下最少需要运行的pod或任务数量。如果集群资源不满足miniMember数量任务的运行需求，调度器将不会调度任何一个该podgroup 内的任务。
queue：queue表示该podgroup所属的queue。queue必须提前已创建且状态为open。
priorityClassName：priorityClassName表示该podgroup的优先级，用于调度器为该queue中所有podgroup进行调度时进行排序。system-node-critical和system-cluster-critical 是2个预留的值，表示最高优先级。不特别指定时，默认使用default优先级或zero优先级。
minResources：minResources表示运行该podgroup所需要的最少资源。当集群可分配资源不满足minResources时，调度器将不会调度任何一个该podgroup内的任务。
phase：phase表示该podgroup当前的状态。
conditions：conditions表示该podgroup的具体状态日志，包含了podgroup生命周期中的关键事件。

检查运行状态

由于集群资源足够作业的所有pod运行，通过命令可知Pod已在运行中。

[root@pm-b86b yaml]# kubectl get po | grep gang
gang-example-worker-0       1/1     Running                  0             31s
gang-example-worker-1       1/1     Running                  0             31s

如果集群资源不足以让所有pod运行，则所有Pod都会调度失败，可通过PodGroup查看调度状态。

[root@pm-b86b yaml]# kubectl get pg gang-example -oyaml
apiVersion: scheduling.roc/v1beta1
kind: PodGroup
metadata:
  annotations:
kubectl.kubernetes.io/last-applied-configuration: |
  {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gang-example","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":10,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}}
  creationTimestamp: "2024-04-14T03:47:09Z"
  generation: 4
  name: gang-example
  namespace: default
  ownerReferences:
apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: TFJob
name: gang-example
uid: 8caecc94-7220-4bbc-bde2-6c94fe478a35
  resourceVersion: "40583543"
  uid: 69034c3f-3c51-4159-b8db-8a965a3838f7
spec:
  minMember: 10
  minResources:
nvidia.com/gpu: "10"
status:
  conditions:

lastTransitionTime: "2024-04-14T03:47:40Z"
message: '10/0 tasks in gang unschedulable: pod group is not ready, 10 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: 3359ff1a-d558-4148-949f-f3f53f501a4c
type: Unschedulable
phase: Pending

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

云容器引擎

云容器引擎

前提条件

背景信息

功能介绍

使用方法

关键字段

检查运行状态

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

云容器引擎

云容器引擎

前提条件

背景信息

功能介绍

使用方法

关键字段

检查运行状态