调度器流程

本文及后续系列记录中均参考kubernetes代码版本1.21,对应仓库分支为release-1.21

  1. kube-scheduler watch etcd,获取podSpec中nodeName为空的pod
  2. pod进入scheduler的相应队列,最终经过调度器流程,会被安排到合适的节点,即通过apiserver写入podSpec的nodeName;也可能调度失败,重回相应的队列
  3. kubelet监听到属于自己所在节点的pod,启动后续的容器相关操作

调度框架流程

具体到pod进入调度器内部的流程,主要由调度框架完成一系列类似于流水线的操作,详情见参考的博客,不再重复描述

scheduler的本地启动

入口位于cmd/kube-scheduler/scheduler.go

首次在开发环境本地启动时,需要先配置kubeconfig,以连接apiserver

在启动参数中增加--kubeconfig=$path-to-kubeconfig后,即可成功在本地启动scheduler

导出默认配置

为了观察scheduler的默认配置(kubescheduler.config.k8s.io/v1beta1组里的KubeSchedulerConfiguration对象),可以继续增加启动参数--write-config-to $path-to-default-config,即可导出默认启动的配置文件。

本次导出的默认配置文件如下:

apiVersion: kubescheduler.config.k8s.io/v1beta1
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  # kubeconfig路径
  kubeconfig: /Users/xxxxx/.kube/config
  qps: 50
enableContentionProfiling: true
enableProfiling: true
healthzBindAddress: 0.0.0.0:10251
kind: KubeSchedulerConfiguration
# 选举相关
leaderElection:
  leaderElect: true
  leaseDuration: 15s
  renewDeadline: 10s
  resourceLock: leases
  resourceName: kube-scheduler
  resourceNamespace: kube-system
  retryPeriod: 2s
metricsBindAddress: 0.0.0.0:10251
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
# 可以同时配置多个profile
# https://kubernetes.io/zh/docs/reference/scheduling/config/#multiple-profiles
profiles:
  # 传给调度框架插件的参数
  - pluginConfig:
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1beta1
          kind: DefaultPreemptionArgs
          minCandidateNodesAbsolute: 100
          minCandidateNodesPercentage: 10
        name: DefaultPreemption
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1beta1
          hardPodAffinityWeight: 1
          kind: InterPodAffinityArgs
        name: InterPodAffinity
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1beta1
          kind: NodeAffinityArgs
        name: NodeAffinity
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1beta1
          kind: NodeResourcesFitArgs
        name: NodeResourcesFit
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1beta1
          kind: NodeResourcesLeastAllocatedArgs
          resources:
            - name: cpu
              weight: 1
            - name: memory
              weight: 1
        name: NodeResourcesLeastAllocated
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1beta1
          defaultingType: System
          kind: PodTopologySpreadArgs
        name: PodTopologySpread
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1beta1
          bindTimeoutSeconds: 600
          kind: VolumeBindingArgs
        name: VolumeBinding
    # 调度框架启用的插件
    plugins:
      bind:
        enabled:
          - name: DefaultBinder
            weight: 0
      filter:
        enabled:
          - name: NodeUnschedulable
            weight: 0
          - name: NodeName
            weight: 0
          - name: TaintToleration
            weight: 0
          - name: NodeAffinity
            weight: 0
          - name: NodePorts
            weight: 0
          - name: NodeResourcesFit
            weight: 0
          - name: VolumeRestrictions
            weight: 0
          - name: EBSLimits
            weight: 0
          - name: GCEPDLimits
            weight: 0
          - name: NodeVolumeLimits
            weight: 0
          - name: AzureDiskLimits
            weight: 0
          - name: VolumeBinding
            weight: 0
          - name: VolumeZone
            weight: 0
          - name: PodTopologySpread
            weight: 0
          - name: InterPodAffinity
            weight: 0
      permit: {}
      postBind: {}
      postFilter:
        enabled:
          - name: DefaultPreemption
            weight: 0
      preBind:
        enabled:
          - name: VolumeBinding
            weight: 0
      preFilter:
        enabled:
          - name: NodeResourcesFit
            weight: 0
          - name: NodePorts
            weight: 0
          - name: PodTopologySpread
            weight: 0
          - name: InterPodAffinity
            weight: 0
          - name: VolumeBinding
            weight: 0
          - name: NodeAffinity
            weight: 0
      preScore:
        enabled:
          - name: InterPodAffinity
            weight: 0
          - name: PodTopologySpread
            weight: 0
          - name: TaintToleration
            weight: 0
          - name: NodeAffinity
            weight: 0
      queueSort:
        enabled:
          - name: PrioritySort
            weight: 0
      reserve:
        enabled:
          - name: VolumeBinding
            weight: 0
      score:
        enabled:
          - name: NodeResourcesBalancedAllocation
            weight: 1
          - name: ImageLocality
            weight: 1
          - name: InterPodAffinity
            weight: 1
          - name: NodeResourcesLeastAllocated
            weight: 1
          - name: NodeAffinity
            weight: 1
          - name: NodePreferAvoidPods
            weight: 10000
          - name: PodTopologySpread
            weight: 2
          - name: TaintToleration
            weight: 1
    # 调度器名称
    schedulerName: default-scheduler

有了配置文件模板后,就可以在该模板的基础上按需修改,然后通过启动参数--config=$path-to-config的方式启动scheduler。需要注意的是,如果指定了--config,那么kubeconfig的配置也将以该配置文件中的clientConnection.kubeconfig字段为准,命令行参数里的--kubeconfig将不再生效。

在动手定制调度框架前,有必要了解配置文件修改后,scheduler自带的默认插件的行为:

  • 如果某个扩展点没有配置对应的扩展,调度框架将使用默认插件中的扩展
  • 如果为某个扩展点配置且激活了扩展,则调度框架将先调用默认插件的扩展,再调用配置中的扩展
  • 默认插件的扩展始终被最先调用,然后按照 KubeSchedulerConfiguration 中扩展的激活 enabled 顺序逐个调用扩展点的扩展
  • 可以先禁用默认插件的扩展,然后在 enabled 列表中的某个位置激活默认插件的扩展,这种做法可以改变默认插件的扩展被调用时的顺序

参考

自定义 Kubernetes 调度器-阳明的博客|Kubernetes|Istio|Prometheus|Python|Golang|云原生 (qikqiak.com)

文章目录