kube-scheduler调度器调度框架源码学习篇0
调度器流程
本文及后续系列记录中均参考kubernetes代码版本1.21,对应仓库分支为release-1.21
- kube-scheduler watch etcd,获取podSpec中nodeName为空的pod
- pod进入scheduler的相应队列,最终经过调度器流程,会被安排到合适的节点,即通过apiserver写入podSpec的nodeName;也可能调度失败,重回相应的队列
- kubelet监听到属于自己所在节点的pod,启动后续的容器相关操作
调度框架流程
具体到pod进入调度器内部的流程,主要由调度框架完成一系列类似于流水线的操作,详情见参考的博客,不再重复描述
scheduler的本地启动
入口位于cmd/kube-scheduler/scheduler.go
首次在开发环境本地启动时,需要先配置kubeconfig,以连接apiserver
在启动参数中增加--kubeconfig=$path-to-kubeconfig
后,即可成功在本地启动scheduler
导出默认配置
为了观察scheduler的默认配置(kubescheduler.config.k8s.io/v1beta1组里的KubeSchedulerConfiguration对象),可以继续增加启动参数--write-config-to $path-to-default-config
,即可导出默认启动的配置文件。
本次导出的默认配置文件如下:
apiVersion: kubescheduler.config.k8s.io/v1beta1
clientConnection:
acceptContentTypes: ""
burst: 100
contentType: application/vnd.kubernetes.protobuf
# kubeconfig路径
kubeconfig: /Users/xxxxx/.kube/config
qps: 50
enableContentionProfiling: true
enableProfiling: true
healthzBindAddress: 0.0.0.0:10251
kind: KubeSchedulerConfiguration
# 选举相关
leaderElection:
leaderElect: true
leaseDuration: 15s
renewDeadline: 10s
resourceLock: leases
resourceName: kube-scheduler
resourceNamespace: kube-system
retryPeriod: 2s
metricsBindAddress: 0.0.0.0:10251
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
# 可以同时配置多个profile
# https://kubernetes.io/zh/docs/reference/scheduling/config/#multiple-profiles
profiles:
# 传给调度框架插件的参数
- pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: DefaultPreemptionArgs
minCandidateNodesAbsolute: 100
minCandidateNodesPercentage: 10
name: DefaultPreemption
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta1
hardPodAffinityWeight: 1
kind: InterPodAffinityArgs
name: InterPodAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: NodeAffinityArgs
name: NodeAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: NodeResourcesFitArgs
name: NodeResourcesFit
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: NodeResourcesLeastAllocatedArgs
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
name: NodeResourcesLeastAllocated
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta1
defaultingType: System
kind: PodTopologySpreadArgs
name: PodTopologySpread
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta1
bindTimeoutSeconds: 600
kind: VolumeBindingArgs
name: VolumeBinding
# 调度框架启用的插件
plugins:
bind:
enabled:
- name: DefaultBinder
weight: 0
filter:
enabled:
- name: NodeUnschedulable
weight: 0
- name: NodeName
weight: 0
- name: TaintToleration
weight: 0
- name: NodeAffinity
weight: 0
- name: NodePorts
weight: 0
- name: NodeResourcesFit
weight: 0
- name: VolumeRestrictions
weight: 0
- name: EBSLimits
weight: 0
- name: GCEPDLimits
weight: 0
- name: NodeVolumeLimits
weight: 0
- name: AzureDiskLimits
weight: 0
- name: VolumeBinding
weight: 0
- name: VolumeZone
weight: 0
- name: PodTopologySpread
weight: 0
- name: InterPodAffinity
weight: 0
permit: {}
postBind: {}
postFilter:
enabled:
- name: DefaultPreemption
weight: 0
preBind:
enabled:
- name: VolumeBinding
weight: 0
preFilter:
enabled:
- name: NodeResourcesFit
weight: 0
- name: NodePorts
weight: 0
- name: PodTopologySpread
weight: 0
- name: InterPodAffinity
weight: 0
- name: VolumeBinding
weight: 0
- name: NodeAffinity
weight: 0
preScore:
enabled:
- name: InterPodAffinity
weight: 0
- name: PodTopologySpread
weight: 0
- name: TaintToleration
weight: 0
- name: NodeAffinity
weight: 0
queueSort:
enabled:
- name: PrioritySort
weight: 0
reserve:
enabled:
- name: VolumeBinding
weight: 0
score:
enabled:
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
- name: InterPodAffinity
weight: 1
- name: NodeResourcesLeastAllocated
weight: 1
- name: NodeAffinity
weight: 1
- name: NodePreferAvoidPods
weight: 10000
- name: PodTopologySpread
weight: 2
- name: TaintToleration
weight: 1
# 调度器名称
schedulerName: default-scheduler
有了配置文件模板后,就可以在该模板的基础上按需修改,然后通过启动参数--config=$path-to-config
的方式启动scheduler。需要注意的是,如果指定了--config
,那么kubeconfig的配置也将以该配置文件中的clientConnection.kubeconfig
字段为准,命令行参数里的--kubeconfig
将不再生效。
在动手定制调度框架前,有必要了解配置文件修改后,scheduler自带的默认插件的行为:
- 如果某个扩展点没有配置对应的扩展,调度框架将使用默认插件中的扩展
- 如果为某个扩展点配置且激活了扩展,则调度框架将先调用默认插件的扩展,再调用配置中的扩展
- 默认插件的扩展始终被最先调用,然后按照
KubeSchedulerConfiguration
中扩展的激活enabled
顺序逐个调用扩展点的扩展- 可以先禁用默认插件的扩展,然后在
enabled
列表中的某个位置激活默认插件的扩展,这种做法可以改变默认插件的扩展被调用时的顺序
参考
自定义 Kubernetes 调度器-阳明的博客|Kubernetes|Istio|Prometheus|Python|Golang|云原生 (qikqiak.com)
本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。