


  • NodeLifecycle Controller负责监视Node状态,根据状态调整Node的Taint

NodeLifecycle Controller有一个--enable-taint-manager启动参数,默认为启用状态,在1.27中移除

  • 如果为true:则NodeLifecycle Controller在node异常的时候,通过在node上添加NoExecute Taint来驱逐Node上的Pod
  • 如果是false:则NodeLifecycle Controller,在node异常的时候,会直接驱逐Pod,不会添加Taint


    // Controller is the controller that manages node's life cycle.
    type Controller struct {
    taintManager *scheduler.NoExecuteTaintManager

    podLister corelisters.PodLister
    podInformerSynced cache.InformerSynced
    kubeClient clientset.Interface

    // This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
    // to avoid the problem with time skew across the cluster.
    now func() metav1.Time

    enterPartialDisruptionFunc func(nodeNum int) float32
    enterFullDisruptionFunc func(nodeNum int) float32
    computeZoneStateFunc func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
    // 混存集群中所有的Node。其中Key为Node的Name
    knownNodeSet map[string]*v1.Node
    // per Node map storing last observed health together with a local time when it was observed.
    // 缓存集群中Node的健康状态,该数据结构是一个带锁的Map,其key为node的name,其value记录了Node
    nodeHealthMap *nodeHealthMap

    // evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
    evictorLock sync.Mutex
    // 缓存集群中的Node的Pod驱逐情况。该数据结构是一个带锁的Map
    // key为Node的Name,value是node上报的Pod驱逐情况
    // Pod驱逐情况分为3种:
    // 1. unmarked:不需要驱逐
    // 2. toBeEvicted:需要驱逐,但是还没有执行
    // 3. eviceted:已经执行过Pod驱逐
    // 这个变量只有在runTaintManager设置为false的时候使用
    nodeEvictionMap *nodeEvictionMap
    // workers that evicts pods from unresponsive nodes.
    // Node 驱逐队列,按照记录需要住区的PodNode
    zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
    // workers that are responsible for tainting nodes.
    zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue

    nodesToRetry sync.Map

    zoneStates map[string]ZoneState

    daemonSetStore appsv1listers.DaemonSetLister
    daemonSetInformerSynced cache.InformerSynced

    leaseLister coordlisters.LeaseLister
    leaseInformerSynced cache.InformerSynced
    nodeLister corelisters.NodeLister
    nodeInformerSynced cache.InformerSynced

    getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)

    broadcaster record.EventBroadcaster
    recorder record.EventRecorder

    // Value controlling Controller monitoring period, i.e. how often does Controller
    // check node health signal posted from kubelet. This value should be lower than
    // nodeMonitorGracePeriod.
    // TODO: Change node health monitor to watch based.
    // Node健康超时事件。如果NodeLifecycle Controller 发现距离上一次kubelet上报node的状态已经超过这个字段规定的时间
    // NodeLifecycle Controller就会把Node 的Condition设置为unknown
    nodeMonitorPeriod time.Duration

    // When node is just created, e.g. cluster bootstrap or node creation, we give
    // a longer grace period.
    nodeStartupGracePeriod time.Duration

    // Controller will not proactively sync node health, but will monitor node
    // health signal updated from kubelet. There are 2 kinds of node healthiness
    // signals: NodeStatus and NodeLease. If it doesn't receive update for this amount
    // of time, it will start posting "NodeReady==ConditionUnknown". The amount of
    // time before which Controller start evicting pods is controlled via flag
    // 'pod-eviction-timeout'.
    // Note: be cautious when changing the constant, it must work with
    // nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
    // controller. The node health signal update frequency is the minimal of the
    // two.
    // There are several constraints:
    // 1. nodeMonitorGracePeriod must be N times more than the node health signal
    // update frequency, where N means number of retries allowed for kubelet to
    // post node status/lease. It is pointless to make nodeMonitorGracePeriod
    // be less than the node health signal update frequency, since there will
    // only be fresh values from Kubelet at an interval of node health signal
    // update frequency. The constant must be less than podEvictionTimeout.
    // 2. nodeMonitorGracePeriod can't be too large for user experience - larger
    // value takes longer for user to see up-to-date node health.
    nodeMonitorGracePeriod time.Duration
    // 驱逐Pod前的等待时间
    podEvictionTimeout time.Duration
    evictionLimiterQPS float32
    secondaryEvictionLimiterQPS float32
    largeClusterThreshold int32
    unhealthyZoneThreshold float32

    // if set to true Controller will start TaintManager that will evict Pods from
    // tainted nodes, if they're not tolerated.
    // 对应前面的启动参数`--enable-taint-manager`
    runTaintManager bool
    // 监听不同的事件的node和pod
    nodeUpdateQueue workqueue.Interface
    podUpdateQueue workqueue.RateLimitingInterface


NodeLifecycle Controller启动过程主要是六种协程:

  • nc.taintManager.Run:仅在runTainManager为true启动。在Node出现NoExecute Taint时,驱逐Node上所有的不能容忍这些Taint的Pod

  • nc.doNodeProcessingPassWorker:在Node的Condition出现异常的时候,在Node上添加No Schedule Taint,阻止新的Pod被调度上来

  • nc.doPodProcessingWorker:用于检查Pod对应的Node状态。在Node的Ready Condition为false或者unknown时,如果runTaintManager为false,则将Node加入到驱逐队列;如果为true,就把Pod的Ready Condition更新为false

  • nc.doNoExecuteTaintingPass:仅在runTaintMnager为true启动。

    对于驱逐队列中的Node,根据Node的健康状态,在Node的Ready Condition设置为false或unknown时Nod添加对应No Execute Taint

  • nc.doEvictionPass: 仅在runTaintManager为false启动。

    对于驱逐队列中的Node,根据Node健康情况,在Node的Ready Condition为false或者unknown时驱逐Node上的Pod

  • nc.monitorNodeHealth: 监听集群中各个Node的健康情况,更新Node的Condition状态,并且在NodeLifecycle Controller缓存Node的状态转台,把异常的Node加入到驱逐队列。该协程是一个循环执行的定时任务,间隔由node Monitor Period决定

    // Run starts an asynchronous loop that monitors the status of cluster nodes.
    func (nc *Controller) Run(ctx context.Context) {
    defer utilruntime.HandleCrash()

    // Start events processing pipeline.
    klog.Infof("Sending events to api server.")
    Interface: v1core.New(nc.kubeClient.CoreV1().RESTClient()).Events(""),
    defer nc.broadcaster.Shutdown()

    // Close node update queue to cleanup go routine.
    defer nc.nodeUpdateQueue.ShutDown()
    defer nc.podUpdateQueue.ShutDown()

    klog.Infof("Starting node controller")
    defer klog.Infof("Shutting down node controller")

    if !cache.WaitForNamedCacheSync("taint", ctx.Done(), nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {

    if nc.runTaintManager {
    go nc.taintManager.Run(ctx)

    // Start workers to reconcile labels and/or update NoSchedule taint for nodes.
    for i := 0; i < scheduler.UpdateWorkerSize; i++ {
    // Thanks to "workqueue", each worker just need to get item from queue, because
    // the item is flagged when got from queue: if new event come, the new item will
    // be re-queued until "Done", so no more than one worker handle the same item and
    // no event missed.
    go wait.UntilWithContext(ctx, nc.doNodeProcessingPassWorker, time.Second)

    for i := 0; i < podUpdateWorkerSize; i++ {
    go wait.UntilWithContext(ctx, nc.doPodProcessingWorker, time.Second)

    if nc.runTaintManager {
    // Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
    // taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
    go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
    } else {
    // Managing eviction of nodes:
    // When we delete pods off a node, if the node was not empty at the time we then
    // queue an eviction watcher. If we hit an error, retry deletion.
    go wait.UntilWithContext(ctx, nc.doEvictionPass, scheduler.NodeEvictionPeriod)

    // Incorporate the results of node health signal pushed from kubelet to master.
    go wait.UntilWithContext(ctx, func(ctx context.Context) {
    if err := nc.monitorNodeHealth(ctx); err != nil {
    klog.Errorf("Error monitoring node health: %v", err)
    }, nc.nodeMonitorPeriod)



  1. nc.monitorNodeHealth:检查到Node的Ready Condition为false或unknown,将Node加入到zoneNoExecuteTainer驱逐队列
  2. nc.doNoExcuteTaintingPass:遍历zoneNoExecuteTainter驱逐队列,如果Node的Ready Condition为false或unknown,则打上NoExecute Taint
  3. nc.taintManager.Run:监听到Node上存在NoExecute Taint后,驱逐所有无法容忍该Taint的Pod


  1. nc.monitorNodeHealth:如果检测到Node的Ready Condition为false或unknown,则将Node加入到zonePodEvictor驱逐队列
  2. nc.doPodProcessingWorker:如果某个Pod对应的Node的Ready Condition为false或unknown,则将Node加入zonePodEvictor驱逐队列
  3. nc.doEvictionPass: 遍历zonePodEvictor,驱逐各个节点上的Pod


NodeLifecycle Controller主要是通过nc.doNoProcessingPassWorker负责根据Node的Condition维护Node 的几个NoSchedule效果的Taint,核心是通过doNoScheduleTaintingPass中

  1. nc.nodeListener.Get


  2. for _,condition:=range node.Status.Conditions

    根据Node Condition添加Taint

    1. Ready Condition
      • 值为false:添加node.kubernetes.io/not-readyTaint
      • 值为unknown: 添加node.kubernetes.io/unreachableTaint`
    2. MemoryPressure Condition
      • true:添加node.kubernetes.io/memory-pressureTaint
    3. DiskPressure Condition
      • True: 添加node.kubernetes.io/disk-pressureTaint
    4. PIPPressure Condition
      • true:添加node.kubernetes.io/pid-pressureTaint
    5. NetworkUnavilable Condition
      • true:添加node.kubernetes.io/network-unavailableTaint
  3. node.Spec.Unschedule


  4. SwapNodeControllerTaint




  1. nc.nodeLister.List



  2. c.tryUpdateNodeHealth


  3. nc.getPodsAssignedToNode


  4. nc.processTaintBaseEviction、nc.processNoTaintBaseEviction


使用了NoExecute Taint驱逐Node上的Pod

如果运行了Taint Manager,异常的Node会被nc.dcNoExecuteTaintingPass协程加NoExecute Taint,再有nc.taintManager.Run协程驱逐这些不能容忍的Taint的Pod


  1. zoneNoExecuteTainterKeys


  2. zoneNoExecuteTainterWorker.Try


    1. nc.nodeLister.Get


    2. GetNodeCondition


    3. switch condition.Status

      • Ready Condition值为false:添加node.kubernetes.io/not-readyTaint
      • Ready Condition值值为unknown: 添加node.kubernetes.io/unreachableTaint`
      • 效果都是NoExecute
    4. SwapNodeControllerTaint



  1. tc.nodeLister.Get


  2. getNoExecuteTaints

    获取Node上所有NoExecute Taint

  3. tc.getPodsAssignedToNode


  4. tc.processPodOnNode



如果没有运行Taint Manager,异常会直接驱逐Node上的Pod

  1. zonePodEvictorKeys


  2. zonePodEvictionWorker.Try


    1. nc.nodeLister.Get


    2. nc.getPodsAssignedToNode


    3. controllerutols.DeletePods


    4. nc.nodeEvictionMap.setStatus
