Kube-scheduler(OverView)

基于1.25

Kube-schedule是K8s的控制平面组件之一，负责调度整个集群的Pod

kube-scheduler调度模型

kube-schedule的调度过程主要可以分为两个周期、三个阶段

两个周期主要是调度周期和绑定周期
三个阶段主要是预选（Filter）、优选（score）、绑定（Bind）三个阶段

kube-schedule在调度周期使用出串行策略，每次调度一个Pod
在绑定周期，使用并行策略

kube-scheduler调度Pod有俩种最优解：

全局最优解：执行调度策略会遍历所有的节点，找出全局最优节点
局部最优解，执行调度策略只遍历部分你节点，找出部分最优解

kube-schduler通过自动调度方法执行调度最优解，默认节点小于等于100，寻找最优解

kube-scheduler内部架构

主要包含Informer、Scheduling Quque，Cache以及Scheduling Framework组件

Scheduling Queue主要是用于缓存待调度的Pod，以便在调度的时候高效获取下一个需要调度的Pod

kube-schedule默认实现的Priotity-Queue，优先队列
优先队列，内部主要是俩个子队列，和一个额外的map数据结构，即activeQ、backoffQ和unscheduables
activeQ：基于堆实现的优先队列，默认使用基于Pod优先级的FIFO排序算法
- activeQ中存放的Pod已经满足调度要求，可以立即被调度
- 调度的主程序会schedOne会不断从activeQ中获取Pod调取
- 如果没有可以调度的Pod，shcedOne会一直阻塞
backoffQ：基于堆实现的优先队列，和activeQ不同，排序算法基于Pod的backoff等待队列事件
- 存放的Pod满足调度要求，但是还是处于调度失败后的backoff等待期，当前到达backoff等待时间后，会被移动到activeQ中
- backoff是为了积极优先调度中的“无穷阻塞”或“饥饿”问题（高优先级的Pod总是被调度，低级别的Pod长时间等待但是不被调度）
- 使用指数退避算法，失败1s，2s，4s…直到最长时间
unschedulablePods：用来存放调度失败的Pod，底层是一个map数据结构，当Pod调度失败，默认会被移动到unschedulablePods
- 当集群状态发变化，unschedulablePods中的Pod可能调度成功，POd会选择性加入activeQ或backoffQ
- 同时每间隔30s，会自动把unschedulablePods等待超过5min加入到activeQ或在backoffQ

Cache的主要作用加入调度过程中对于Pod和节点信息的检索

为什么已经使用了Informer机制，Informer默认会在本地缓存一份数据，还需要设计Cache？

Informer虽然提高了资源对象的读取数据，但是还需要在调度时候实时计算出调度相关的数据
为了保证调度的准确性，还引入了snapshot快照机制

Scheduling Freamewrok是串联调度算法的关键，把调度过程定义为一系列的拓展点，每个拓展点都可以注册实现一些具体调度算法的插件

Scheduling Freamewrok使用独立协程驱动，永不退出
支持Assume机制，在Pod选定节点后，直接更新Cache状态

kube-scheduler事件驱动

K8s的组件之间通过事件交互，通过事件来源于Informer，注册方法分为：内置默认监听的资源事件和自定义监听的资源是假

内置默认的资源事件：调度器程序默认关注的核心资源对象事件，即关于Pod和Node的事件，由调度框架默认完成事件注册
插件自定义监听的资源事件：插件调度算法有关，通过插件声明的方法进行事件注册，例如VolumeBinding需要监听PV/PVC的资源变化

为了支持拓展资源对象事件，kube-scheduler除了使用标准的SharedInformer，害引入了支持动态类型的DynamicSharedInformer

kube-scheduler默认自动注册对Pod和Node监听事件：

对于Pod事件：
- Pod已经进入已经调度阶段（spec.nodeName不为空）：首先更新更新Cache到，确保缓存状态一致，同时从调度失败的Pod中找到与该Pod存在亲和的Pod，重新触发调度
- 未调度状态：更新Scheduling Queue，确保Pod能被调度，如果是未调度的Pod被删除，并且存在已经Assumed的Pod（经过了调度周期但是没有完成绑定周期的Pod）等待该Pod被调度成功，调度器会释放已经Assumed的Pod所释放的资源，重新尝试对未调度成功的Pod
对于Node事件：更新cache。即缓存一致，同时尝试调度那些可能以为Node而进入到可调度的Pod

Scheduling Framework提供了一个开放接口，实现所关注的资源对象类型以及事件

Ref：https://github.com/kubernetes/kubernetes/blob/810e9e212ec5372d16b655f57b9231d8654a2179/pkg/scheduler/framework/interface.go#L482

// EnqueueExtensions is an optional interface that plugins can implement to efficiently
// move unschedulable Pods in internal scheduling queues.
// In the scheduler, Pods can be unschedulable by PreEnqueue, PreFilter, Filter, Reserve, and Permit plugins,
// and Pods rejected by these plugins are requeued based on this extension point.
// Failures from other extension points are regarded as temporal errors (e.g., network failure),
// and the scheduler requeue Pods without this extension point - always requeue Pods to activeQ after backoff.
// This is because such temporal errors cannot be resolved by specific cluster events,
// and we have no choose but keep retrying scheduling until the failure is resolved.
//
// Plugins that make pod unschedulable (PreEnqueue, PreFilter, Filter, Reserve, and Permit plugins) should implement this interface,
// otherwise the default implementation will be used, which is less efficient in requeueing Pods rejected by the plugin.
// And, if plugins other than above extension points support this interface, they are just ignored.
type EnqueueExtensions interface {
	Plugin
	// EventsToRegister returns a series of possible events that may cause a Pod
	// failed by this plugin schedulable. Each event has a callback function that
	// filters out events to reduce useless retry of Pod's scheduling.
	// The events will be registered when instantiating the internal scheduling queue,
	// and leveraged to build event handlers dynamically.
	// When it returns an error, the scheduler fails to start.
	// Note: the returned list needs to be determined at a startup,
	// and the scheduler only evaluates it once during start up.
	// Do not change the result during runtime, for example, based on the cluster's state etc.
	//
	// Appropriate implementation of this function will make Pod's re-scheduling accurate and performant.
	EventsToRegister(context.Context) ([]ClusterEventWithHint, error)
}