Kube-scheduler(OverView)

基于1.25

Kube-schedule是K8s的控制平面组件之一,负责调度整个集群的Pod

kube-scheduler调度模型

kube-schedule的调度过程主要可以分为两个周期、三个阶段

  • 两个周期主要是调度周期和绑定周期
  • 三个阶段主要是预选(Filter)、优选(score)、绑定(Bind)三个阶段

  • kube-schedule在调度周期使用出串行策略,每次调度一个Pod
  • 在绑定周期,使用并行策略

kube-scheduler调度Pod有俩种最优解:

  1. 全局最优解:执行调度策略会遍历所有的节点,找出全局最优节点
  2. 局部最优解,执行调度策略只遍历部分你节点,找出部分最优解
  • kube-schduler通过自动调度方法执行调度最优解,默认节点小于等于100,寻找最优解

kube-scheduler内部架构

主要包含Informer、Scheduling Quque,Cache以及Scheduling Framework组件

Scheduling Queue主要是用于缓存待调度的Pod,以便在调度的时候高效获取下一个需要调度的Pod

  • kube-schedule默认实现的Priotity-Queue,优先队列
  • 优先队列,内部主要是俩个子队列,和一个额外的map数据结构,即activeQ、backoffQ和unscheduables
  • activeQ:基于堆实现的优先队列,默认使用基于Pod优先级的FIFO排序算法
    • activeQ中存放的Pod已经满足调度要求,可以立即被调度
    • 调度的主程序会schedOne会不断从activeQ中获取Pod调取
    • 如果没有可以调度的Pod,shcedOne会一直阻塞
  • backoffQ:基于堆实现的优先队列,和activeQ不同,排序算法基于Pod的backoff等待队列事件
    • 存放的Pod满足调度要求,但是还是处于调度失败后的backoff等待期,当前到达backoff等待时间后,会被移动到activeQ中
    • backoff是为了积极优先调度中的“无穷阻塞”或“饥饿”问题(高优先级的Pod总是被调度,低级别的Pod长时间等待但是不被调度)
    • 使用指数退避算法,失败1s,2s,4s…直到最长时间
  • unschedulablePods:用来存放调度失败的Pod,底层是一个map数据结构,当Pod调度失败,默认会被移动到unschedulablePods
    • 当集群状态发变化,unschedulablePods中的Pod可能调度成功,POd会选择性加入activeQ或backoffQ
    • 同时每间隔30s,会自动把unschedulablePods等待超过5min加入到activeQ或在backoffQ

Cache的主要作用加入调度过程中对于Pod和节点信息的检索

为什么已经使用了Informer机制,Informer默认会在本地缓存一份数据,还需要设计Cache?

  • Informer虽然提高了资源对象的读取数据,但是还需要在调度时候实时计算出调度相关的数据
  • 为了保证调度的准确性,还引入了snapshot快照机制

Scheduling Freamewrok是串联调度算法的关键,把调度过程定义为一系列的拓展点,每个拓展点都可以注册实现一些具体调度算法的插件

  • Scheduling Freamewrok使用独立协程驱动,永不退出

  • 支持Assume机制,在Pod选定节点后,直接更新Cache状态

kube-scheduler事件驱动

K8s的组件之间通过事件交互,通过事件来源于Informer,注册方法分为:内置默认监听的资源事件和自定义监听的资源是假

  • 内置默认的资源事件:调度器程序默认关注的核心资源对象事件,即关于Pod和Node的事件,由调度框架默认完成事件注册
  • 插件自定义监听的资源事件:插件调度算法有关,通过插件声明的方法进行事件注册,例如VolumeBinding需要监听PV/PVC的资源变化

为了支持拓展资源对象事件,kube-scheduler除了使用标准的SharedInformer,害引入了支持动态类型的DynamicSharedInformer

kube-scheduler默认自动注册对Pod和Node监听事件:

  1. 对于Pod事件:
    • Pod已经进入已经调度阶段(spec.nodeName不为空):首先更新更新Cache到,确保缓存状态一致,同时从调度失败的Pod中找到与该Pod存在亲和的Pod,重新触发调度
    • 未调度状态:更新Scheduling Queue,确保Pod能被调度,如果是未调度的Pod被删除,并且存在已经Assumed的Pod(经过了调度周期但是没有完成绑定周期的Pod)等待该Pod被调度成功,调度器会释放已经Assumed的Pod所释放的资源,重新尝试对未调度成功的Pod
  2. 对于Node事件:更新cache。即缓存一致,同时尝试调度那些可能以为Node而进入到可调度的Pod

Scheduling Framework提供了一个开放接口,实现所关注的资源对象类型以及事件

  • Ref:https://github.com/kubernetes/kubernetes/blob/810e9e212ec5372d16b655f57b9231d8654a2179/pkg/scheduler/framework/interface.go#L482

    // EnqueueExtensions is an optional interface that plugins can implement to efficiently
    // move unschedulable Pods in internal scheduling queues.
    // In the scheduler, Pods can be unschedulable by PreEnqueue, PreFilter, Filter, Reserve, and Permit plugins,
    // and Pods rejected by these plugins are requeued based on this extension point.
    // Failures from other extension points are regarded as temporal errors (e.g., network failure),
    // and the scheduler requeue Pods without this extension point - always requeue Pods to activeQ after backoff.
    // This is because such temporal errors cannot be resolved by specific cluster events,
    // and we have no choose but keep retrying scheduling until the failure is resolved.
    //
    // Plugins that make pod unschedulable (PreEnqueue, PreFilter, Filter, Reserve, and Permit plugins) should implement this interface,
    // otherwise the default implementation will be used, which is less efficient in requeueing Pods rejected by the plugin.
    // And, if plugins other than above extension points support this interface, they are just ignored.
    type EnqueueExtensions interface {
    Plugin
    // EventsToRegister returns a series of possible events that may cause a Pod
    // failed by this plugin schedulable. Each event has a callback function that
    // filters out events to reduce useless retry of Pod's scheduling.
    // The events will be registered when instantiating the internal scheduling queue,
    // and leveraged to build event handlers dynamically.
    // When it returns an error, the scheduler fails to start.
    // Note: the returned list needs to be determined at a startup,
    // and the scheduler only evaluates it once during start up.
    // Do not change the result during runtime, for example, based on the cluster's state etc.
    //
    // Appropriate implementation of this function will make Pod's re-scheduling accurate and performant.
    EventsToRegister(context.Context) ([]ClusterEventWithHint, error)
    }