Kube-controller-manager(NodeLifecycleController)

基于1.25

K8s的Pod分配到某个Node中运行。当Node出现异常,集群需要把Node状态更新为异常,并且驱逐上面的Pod到其他Node上运行。

  • NodeLifecycle Controller负责监视Node状态,根据状态调整Node的Taint

NodeLifecycle Controller有一个--enable-taint-manager启动参数,默认为启用状态,在1.27中移除

  • 如果为true:则NodeLifecycle Controller在node异常的时候,通过在node上添加NoExecute Taint来驱逐Node上的Pod
  • 如果是false:则NodeLifecycle Controller,在node异常的时候,会直接驱逐Pod,不会添加Taint

控制器初始化

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L261


    // Controller is the controller that manages node's life cycle.
    type Controller struct {
    taintManager *scheduler.NoExecuteTaintManager

    podLister corelisters.PodLister
    podInformerSynced cache.InformerSynced
    kubeClient clientset.Interface

    // This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
    // to avoid the problem with time skew across the cluster.
    now func() metav1.Time

    enterPartialDisruptionFunc func(nodeNum int) float32
    enterFullDisruptionFunc func(nodeNum int) float32
    computeZoneStateFunc func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
    // 混存集群中所有的Node。其中Key为Node的Name
    knownNodeSet map[string]*v1.Node
    // per Node map storing last observed health together with a local time when it was observed.
    // 缓存集群中Node的健康状态,该数据结构是一个带锁的Map,其key为node的name,其value记录了Node
    nodeHealthMap *nodeHealthMap

    // evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
    evictorLock sync.Mutex
    // 缓存集群中的Node的Pod驱逐情况。该数据结构是一个带锁的Map
    // key为Node的Name,value是node上报的Pod驱逐情况
    // Pod驱逐情况分为3种:
    // 1. unmarked:不需要驱逐
    // 2. toBeEvicted:需要驱逐,但是还没有执行
    // 3. eviceted:已经执行过Pod驱逐
    // 这个变量只有在runTaintManager设置为false的时候使用
    nodeEvictionMap *nodeEvictionMap
    // workers that evicts pods from unresponsive nodes.
    // Node 驱逐队列,按照记录需要住区的PodNode
    zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
    // workers that are responsible for tainting nodes.
    zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue

    nodesToRetry sync.Map

    zoneStates map[string]ZoneState

    daemonSetStore appsv1listers.DaemonSetLister
    daemonSetInformerSynced cache.InformerSynced

    leaseLister coordlisters.LeaseLister
    leaseInformerSynced cache.InformerSynced
    nodeLister corelisters.NodeLister
    nodeInformerSynced cache.InformerSynced

    getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)

    broadcaster record.EventBroadcaster
    recorder record.EventRecorder

    // Value controlling Controller monitoring period, i.e. how often does Controller
    // check node health signal posted from kubelet. This value should be lower than
    // nodeMonitorGracePeriod.
    // TODO: Change node health monitor to watch based.
    // Node健康超时事件。如果NodeLifecycle Controller 发现距离上一次kubelet上报node的状态已经超过这个字段规定的时间
    // NodeLifecycle Controller就会把Node 的Condition设置为unknown
    nodeMonitorPeriod time.Duration

    // When node is just created, e.g. cluster bootstrap or node creation, we give
    // a longer grace period.
    nodeStartupGracePeriod time.Duration

    // Controller will not proactively sync node health, but will monitor node
    // health signal updated from kubelet. There are 2 kinds of node healthiness
    // signals: NodeStatus and NodeLease. If it doesn't receive update for this amount
    // of time, it will start posting "NodeReady==ConditionUnknown". The amount of
    // time before which Controller start evicting pods is controlled via flag
    // 'pod-eviction-timeout'.
    // Note: be cautious when changing the constant, it must work with
    // nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
    // controller. The node health signal update frequency is the minimal of the
    // two.
    // There are several constraints:
    // 1. nodeMonitorGracePeriod must be N times more than the node health signal
    // update frequency, where N means number of retries allowed for kubelet to
    // post node status/lease. It is pointless to make nodeMonitorGracePeriod
    // be less than the node health signal update frequency, since there will
    // only be fresh values from Kubelet at an interval of node health signal
    // update frequency. The constant must be less than podEvictionTimeout.
    // 2. nodeMonitorGracePeriod can't be too large for user experience - larger
    // value takes longer for user to see up-to-date node health.
    nodeMonitorGracePeriod time.Duration
    // 驱逐Pod前的等待时间
    podEvictionTimeout time.Duration
    evictionLimiterQPS float32
    secondaryEvictionLimiterQPS float32
    largeClusterThreshold int32
    unhealthyZoneThreshold float32

    // if set to true Controller will start TaintManager that will evict Pods from
    // tainted nodes, if they're not tolerated.
    // 对应前面的启动参数`--enable-taint-manager`
    runTaintManager bool
    // 监听不同的事件的node和pod
    nodeUpdateQueue workqueue.Interface
    podUpdateQueue workqueue.RateLimitingInterface
    }

主要执行逻辑

NodeLifecycle Controller启动过程主要是六种协程:

  • nc.taintManager.Run:仅在runTainManager为true启动。在Node出现NoExecute Taint时,驱逐Node上所有的不能容忍这些Taint的Pod

  • nc.doNodeProcessingPassWorker:在Node的Condition出现异常的时候,在Node上添加No Schedule Taint,阻止新的Pod被调度上来

  • nc.doPodProcessingWorker:用于检查Pod对应的Node状态。在Node的Ready Condition为false或者unknown时,如果runTaintManager为false,则将Node加入到驱逐队列;如果为true,就把Pod的Ready Condition更新为false

  • nc.doNoExecuteTaintingPass:仅在runTaintMnager为true启动。

    对于驱逐队列中的Node,根据Node的健康状态,在Node的Ready Condition设置为false或unknown时Nod添加对应No Execute Taint

  • nc.doEvictionPass: 仅在runTaintManager为false启动。

    对于驱逐队列中的Node,根据Node健康情况,在Node的Ready Condition为false或者unknown时驱逐Node上的Pod

  • nc.monitorNodeHealth: 监听集群中各个Node的健康情况,更新Node的Condition状态,并且在NodeLifecycle Controller缓存Node的状态转台,把异常的Node加入到驱逐队列。该协程是一个循环执行的定时任务,间隔由node Monitor Period决定

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L527

    // Run starts an asynchronous loop that monitors the status of cluster nodes.
    func (nc *Controller) Run(ctx context.Context) {
    defer utilruntime.HandleCrash()

    // Start events processing pipeline.
    nc.broadcaster.StartStructuredLogging(0)
    klog.Infof("Sending events to api server.")
    nc.broadcaster.StartRecordingToSink(
    &v1core.EventSinkImpl{
    Interface: v1core.New(nc.kubeClient.CoreV1().RESTClient()).Events(""),
    })
    defer nc.broadcaster.Shutdown()

    // Close node update queue to cleanup go routine.
    defer nc.nodeUpdateQueue.ShutDown()
    defer nc.podUpdateQueue.ShutDown()

    klog.Infof("Starting node controller")
    defer klog.Infof("Shutting down node controller")

    if !cache.WaitForNamedCacheSync("taint", ctx.Done(), nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {
    return
    }

    if nc.runTaintManager {
    go nc.taintManager.Run(ctx)
    }

    // Start workers to reconcile labels and/or update NoSchedule taint for nodes.
    for i := 0; i < scheduler.UpdateWorkerSize; i++ {
    // Thanks to "workqueue", each worker just need to get item from queue, because
    // the item is flagged when got from queue: if new event come, the new item will
    // be re-queued until "Done", so no more than one worker handle the same item and
    // no event missed.
    go wait.UntilWithContext(ctx, nc.doNodeProcessingPassWorker, time.Second)
    }

    for i := 0; i < podUpdateWorkerSize; i++ {
    go wait.UntilWithContext(ctx, nc.doPodProcessingWorker, time.Second)
    }

    if nc.runTaintManager {
    // Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
    // taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
    go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
    } else {
    // Managing eviction of nodes:
    // When we delete pods off a node, if the node was not empty at the time we then
    // queue an eviction watcher. If we hit an error, retry deletion.
    go wait.UntilWithContext(ctx, nc.doEvictionPass, scheduler.NodeEvictionPeriod)
    }

    // Incorporate the results of node health signal pushed from kubelet to master.
    go wait.UntilWithContext(ctx, func(ctx context.Context) {
    if err := nc.monitorNodeHealth(ctx); err != nil {
    klog.Errorf("Error monitoring node health: %v", err)
    }
    }, nc.nodeMonitorPeriod)

    <-ctx.Done()
    }

当runTaintManager为true,Node异常之后,需要进行Pod驱逐流程如下:

  1. nc.monitorNodeHealth:检查到Node的Ready Condition为false或unknown,将Node加入到zoneNoExecuteTainer驱逐队列
  2. nc.doNoExcuteTaintingPass:遍历zoneNoExecuteTainter驱逐队列,如果Node的Ready Condition为false或unknown,则打上NoExecute Taint
  3. nc.taintManager.Run:监听到Node上存在NoExecute Taint后,驱逐所有无法容忍该Taint的Pod

当runTaintManager为false,Node异常之后,会执行以下逻辑:

  1. nc.monitorNodeHealth:如果检测到Node的Ready Condition为false或unknown,则将Node加入到zonePodEvictor驱逐队列
  2. nc.doPodProcessingWorker:如果某个Pod对应的Node的Ready Condition为false或unknown,则将Node加入zonePodEvictor驱逐队列
  3. nc.doEvictionPass: 遍历zonePodEvictor,驱逐各个节点上的Pod

添加NoSchedule的Taint

NodeLifecycle Controller主要是通过nc.doNoProcessingPassWorker负责根据Node的Condition维护Node 的几个NoSchedule效果的Taint,核心是通过doNoScheduleTaintingPass中

  1. nc.nodeListener.Get

    获取Node资源对象

  2. for _,condition:=range node.Status.Conditions

    根据Node Condition添加Taint

    1. Ready Condition
      • 值为false:添加node.kubernetes.io/not-readyTaint
      • 值为unknown: 添加node.kubernetes.io/unreachableTaint`
    2. MemoryPressure Condition
      • true:添加node.kubernetes.io/memory-pressureTaint
    3. DiskPressure Condition
      • True: 添加node.kubernetes.io/disk-pressureTaint
    4. PIPPressure Condition
      • true:添加node.kubernetes.io/pid-pressureTaint
    5. NetworkUnavilable Condition
      • true:添加node.kubernetes.io/network-unavailableTaint
  3. node.Spec.Unschedule

    根据node.Spec.Unschedule添加Taint

  4. SwapNodeControllerTaint

    更新Node的Taint

Node健康状态检测

nc.monitorNodeHealth负责根据Node的状态,将异常的Node加入到驱逐队列

  1. nc.nodeLister.List

    获取集群中所有的Node

    定时循环执行,每次对集群中的Node进行检查

  2. c.tryUpdateNodeHealth

    检查和更新Node最新的Condition

  3. nc.getPodsAssignedToNode

    获取Node上的Pod

  4. nc.processTaintBaseEviction、nc.processNoTaintBaseEviction

    判断Node的状态加入驱逐队列

使用了NoExecute Taint驱逐Node上的Pod

如果运行了Taint Manager,异常的Node会被nc.dcNoExecuteTaintingPass协程加NoExecute Taint,再有nc.taintManager.Run协程驱逐这些不能容忍的Taint的Pod

nc.dcNoExecuteTaintingPass

  1. zoneNoExecuteTainterKeys

    获取所有Zone的名称

  2. zoneNoExecuteTainterWorker.Try

    为每个Zone的每个Node添加Taint

    1. nc.nodeLister.Get

      获取Node资源对象

    2. GetNodeCondition

      获取当前Node的Condition

    3. switch condition.Status

      • Ready Condition值为false:添加node.kubernetes.io/not-readyTaint
      • Ready Condition值值为unknown: 添加node.kubernetes.io/unreachableTaint`
      • 效果都是NoExecute
    4. SwapNodeControllerTaint

      调用kube-apiserve更新Taint

nc.taintManager.Run

  1. tc.nodeLister.Get

    获取Node资源对象

  2. getNoExecuteTaints

    获取Node上所有NoExecute Taint

  3. tc.getPodsAssignedToNode

    获取Node上的Pod

  4. tc.processPodOnNode

    驱逐Pod

直接驱逐Node上的Pod

如果没有运行Taint Manager,异常会直接驱逐Node上的Pod

  1. zonePodEvictorKeys

    获取所有Zone名称

  2. zonePodEvictionWorker.Try

    驱逐每个zone的每个Node上的Pod

    1. nc.nodeLister.Get

      获取Node

    2. nc.getPodsAssignedToNode

      获取Node上的Pod

    3. controllerutols.DeletePods

      驱逐Pod

    4. nc.nodeEvictionMap.setStatus

      将Node驱逐状态更改为Evicted