Kube-scheduler(OverView)
Kube-scheduler(OverView)
基于1.25
Kube-schedule是K8s的控制平面组件之一,负责调度整个集群的Pod
kube-scheduler调度模型
kube-schedule的调度过程主要可以分为两个周期、三个阶段
- 两个周期主要是调度周期和绑定周期
- 三个阶段主要是预选(Filter)、优选(score)、绑定(Bind)三个阶段
- kube-schedule在调度周期使用出串行策略,每次调度一个Pod
- 在绑定周期,使用并行策略
kube-scheduler调度Pod有俩种最优解:
- 全局最优解:执行调度策略会遍历所有的节点,找出全局最优节点
- 局部最优解,执行调度策略只遍历部分你节点,找出部分最优解
- kube-schduler通过自动调度方法执行调度最优解,默认节点小于等于100,寻找最优解
kube-scheduler内部架构
主要包含Informer、Scheduling Quque,Cache以及Scheduling Framework组件
Scheduling Queue主要是用于缓存待调度的Pod,以便在调度的时候高效获取下一个需要调度的Pod
- kube-schedule默认实现的Priotity-Queue,优先队列
- 优先队列,内部主要是俩个子队列,和一个额外的map数据结构,即activeQ、backoffQ和unscheduables
- activeQ:基于堆实现的优先队列,默认使用基于Pod优先级的FIFO排序算法
- activeQ中存放的Pod已经满足调度要求,可以立即被调度
- 调度的主程序会schedOne会不断从activeQ中获取Pod调取
- 如果没有可以调度的Pod,shcedOne会一直阻塞
- backoffQ:基于堆实现的优先队列,和activeQ不同,排序算法基于Pod的backoff等待队列事件
- 存放的Pod满足调度要求,但是还是处于调度失败后的backoff等待期,当前到达backoff等待时间后,会被移动到activeQ中
- backoff是为了积极优先调度中的“无穷阻塞”或“饥饿”问题(高优先级的Pod总是被调度,低级别的Pod长时间等待但是不被调度)
- 使用指数退避算法,失败1s,2s,4s…直到最长时间
- unschedulablePods:用来存放调度失败的Pod,底层是一个map数据结构,当Pod调度失败,默认会被移动到unschedulablePods
- 当集群状态发变化,unschedulablePods中的Pod可能调度成功,POd会选择性加入activeQ或backoffQ
- 同时每间隔30s,会自动把unschedulablePods等待超过5min加入到activeQ或在backoffQ
Cache的主要作用加入调度过程中对于Pod和节点信息的检索
为什么已经使用了Informer机制,Informer默认会在本地缓存一份数据,还需要设计Cache?
- Informer虽然提高了资源对象的读取数据,但是还需要在调度时候实时计算出调度相关的数据
- 为了保证调度的准确性,还引入了snapshot快照机制
Scheduling Freamewrok是串联调度算法的关键,把调度过程定义为一系列的拓展点,每个拓展点都可以注册实现一些具体调度算法的插件
Scheduling Freamewrok使用独立协程驱动,永不退出
支持Assume机制,在Pod选定节点后,直接更新Cache状态
kube-scheduler事件驱动
K8s的组件之间通过事件交互,通过事件来源于Informer,注册方法分为:内置默认监听的资源事件和自定义监听的资源是假
- 内置默认的资源事件:调度器程序默认关注的核心资源对象事件,即关于Pod和Node的事件,由调度框架默认完成事件注册
- 插件自定义监听的资源事件:插件调度算法有关,通过插件声明的方法进行事件注册,例如VolumeBinding需要监听PV/PVC的资源变化
为了支持拓展资源对象事件,kube-scheduler除了使用标准的SharedInformer,害引入了支持动态类型的DynamicSharedInformer
kube-scheduler默认自动注册对Pod和Node监听事件:
- 对于Pod事件:
- Pod已经进入已经调度阶段(spec.nodeName不为空):首先更新更新Cache到,确保缓存状态一致,同时从调度失败的Pod中找到与该Pod存在亲和的Pod,重新触发调度
- 未调度状态:更新Scheduling Queue,确保Pod能被调度,如果是未调度的Pod被删除,并且存在已经Assumed的Pod(经过了调度周期但是没有完成绑定周期的Pod)等待该Pod被调度成功,调度器会释放已经Assumed的Pod所释放的资源,重新尝试对未调度成功的Pod
- 对于Node事件:更新cache。即缓存一致,同时尝试调度那些可能以为Node而进入到可调度的Pod
Scheduling Framework提供了一个开放接口,实现所关注的资源对象类型以及事件
-
// EnqueueExtensions is an optional interface that plugins can implement to efficiently
// move unschedulable Pods in internal scheduling queues.
// In the scheduler, Pods can be unschedulable by PreEnqueue, PreFilter, Filter, Reserve, and Permit plugins,
// and Pods rejected by these plugins are requeued based on this extension point.
// Failures from other extension points are regarded as temporal errors (e.g., network failure),
// and the scheduler requeue Pods without this extension point - always requeue Pods to activeQ after backoff.
// This is because such temporal errors cannot be resolved by specific cluster events,
// and we have no choose but keep retrying scheduling until the failure is resolved.
//
// Plugins that make pod unschedulable (PreEnqueue, PreFilter, Filter, Reserve, and Permit plugins) should implement this interface,
// otherwise the default implementation will be used, which is less efficient in requeueing Pods rejected by the plugin.
// And, if plugins other than above extension points support this interface, they are just ignored.
type EnqueueExtensions interface {
Plugin
// EventsToRegister returns a series of possible events that may cause a Pod
// failed by this plugin schedulable. Each event has a callback function that
// filters out events to reduce useless retry of Pod's scheduling.
// The events will be registered when instantiating the internal scheduling queue,
// and leveraged to build event handlers dynamically.
// When it returns an error, the scheduler fails to start.
// Note: the returned list needs to be determined at a startup,
// and the scheduler only evaluates it once during start up.
// Do not change the result during runtime, for example, based on the cluster's state etc.
//
// Appropriate implementation of this function will make Pod's re-scheduling accurate and performant.
EventsToRegister(context.Context) ([]ClusterEventWithHint, error)
}
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Joohwan!
评论