Kube-controller-manager(NodeLifecycleController)

基于1.25

K8s的Pod分配到某个Node中运行。当Node出现异常，集群需要把Node状态更新为异常，并且驱逐上面的Pod到其他Node上运行。

NodeLifecycle Controller负责监视Node状态，根据状态调整Node的Taint

NodeLifecycle Controller有一个--enable-taint-manager启动参数，默认为启用状态，在1.27中移除

如果为true：则NodeLifecycle Controller在node异常的时候，通过在node上添加NoExecute Taint来驱逐Node上的Pod
如果是false：则NodeLifecycle Controller，在node异常的时候，会直接驱逐Pod，不会添加Taint

控制器初始化

Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L261


// Controller is the controller that manages node's life cycle.
type Controller struct {
	taintManager *scheduler.NoExecuteTaintManager

	podLister         corelisters.PodLister
	podInformerSynced cache.InformerSynced
	kubeClient        clientset.Interface

	// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
	// to avoid the problem with time skew across the cluster.
	now func() metav1.Time

	enterPartialDisruptionFunc func(nodeNum int) float32
	enterFullDisruptionFunc    func(nodeNum int) float32
	computeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
	// 混存集群中所有的Node。其中Key为Node的Name
	knownNodeSet map[string]*v1.Node
	// per Node map storing last observed health together with a local time when it was observed.
  // 缓存集群中Node的健康状态，该数据结构是一个带锁的Map，其key为node的name，其value记录了Node
	nodeHealthMap *nodeHealthMap

	// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
	evictorLock     sync.Mutex
  // 缓存集群中的Node的Pod驱逐情况。该数据结构是一个带锁的Map
  // key为Node的Name，value是node上报的Pod驱逐情况
  // Pod驱逐情况分为3种：
  // 1. unmarked：不需要驱逐
  // 2. toBeEvicted：需要驱逐，但是还没有执行
  // 3. eviceted：已经执行过Pod驱逐
  // 这个变量只有在runTaintManager设置为false的时候使用
	nodeEvictionMap *nodeEvictionMap
	// workers that evicts pods from unresponsive nodes.
  // Node 驱逐队列，按照记录需要住区的PodNode
	zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
	// workers that are responsible for tainting nodes.
	zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue

	nodesToRetry sync.Map

	zoneStates map[string]ZoneState

	daemonSetStore          appsv1listers.DaemonSetLister
	daemonSetInformerSynced cache.InformerSynced

	leaseLister         coordlisters.LeaseLister
	leaseInformerSynced cache.InformerSynced
	nodeLister          corelisters.NodeLister
	nodeInformerSynced  cache.InformerSynced

	getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)

	broadcaster record.EventBroadcaster
	recorder    record.EventRecorder

	// Value controlling Controller monitoring period, i.e. how often does Controller
	// check node health signal posted from kubelet. This value should be lower than
	// nodeMonitorGracePeriod.
	// TODO: Change node health monitor to watch based.
  // Node健康超时事件。如果NodeLifecycle Controller 发现距离上一次kubelet上报node的状态已经超过这个字段规定的时间
  // NodeLifecycle Controller就会把Node 的Condition设置为unknown
	nodeMonitorPeriod time.Duration

	// When node is just created, e.g. cluster bootstrap or node creation, we give
	// a longer grace period.
	nodeStartupGracePeriod time.Duration

	// Controller will not proactively sync node health, but will monitor node
	// health signal updated from kubelet. There are 2 kinds of node healthiness
	// signals: NodeStatus and NodeLease. If it doesn't receive update for this amount
	// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
	// time before which Controller start evicting pods is controlled via flag
	// 'pod-eviction-timeout'.
	// Note: be cautious when changing the constant, it must work with
	// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
	// controller. The node health signal update frequency is the minimal of the
	// two.
	// There are several constraints:
	// 1. nodeMonitorGracePeriod must be N times more than  the node health signal
	//    update frequency, where N means number of retries allowed for kubelet to
	//    post node status/lease. It is pointless to make nodeMonitorGracePeriod
	//    be less than the node health signal update frequency, since there will
	//    only be fresh values from Kubelet at an interval of node health signal
	//    update frequency. The constant must be less than podEvictionTimeout.
	// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
	//    value takes longer for user to see up-to-date node health.
	nodeMonitorGracePeriod time.Duration
  // 驱逐Pod前的等待时间
	podEvictionTimeout          time.Duration
	evictionLimiterQPS          float32
	secondaryEvictionLimiterQPS float32
	largeClusterThreshold       int32
	unhealthyZoneThreshold      float32

	// if set to true Controller will start TaintManager that will evict Pods from
	// tainted nodes, if they're not tolerated.
  // 对应前面的启动参数`--enable-taint-manager`
	runTaintManager bool
	// 监听不同的事件的node和pod
	nodeUpdateQueue workqueue.Interface
	podUpdateQueue  workqueue.RateLimitingInterface
}

主要执行逻辑

NodeLifecycle Controller启动过程主要是六种协程：

nc.taintManager.Run:仅在runTainManager为true启动。在Node出现NoExecute Taint时，驱逐Node上所有的不能容忍这些Taint的Pod
nc.doNodeProcessingPassWorker:在Node的Condition出现异常的时候，在Node上添加No Schedule Taint，阻止新的Pod被调度上来
nc.doPodProcessingWorker:用于检查Pod对应的Node状态。在Node的Ready Condition为false或者unknown时，如果runTaintManager为false，则将Node加入到驱逐队列；如果为true，就把Pod的Ready Condition更新为false
nc.doNoExecuteTaintingPass:仅在runTaintMnager为true启动。

对于驱逐队列中的Node，根据Node的健康状态，在Node的Ready Condition设置为false或unknown时Nod添加对应No Execute Taint
nc.doEvictionPass: 仅在runTaintManager为false启动。

对于驱逐队列中的Node，根据Node健康情况，在Node的Ready Condition为false或者unknown时驱逐Node上的Pod
nc.monitorNodeHealth: 监听集群中各个Node的健康情况，更新Node的Condition状态，并且在NodeLifecycle Controller缓存Node的状态转台，把异常的Node加入到驱逐队列。该协程是一个循环执行的定时任务，间隔由node Monitor Period决定

Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L527

// Run starts an asynchronous loop that monitors the status of cluster nodes.
func (nc *Controller) Run(ctx context.Context) {
	defer utilruntime.HandleCrash()

	// Start events processing pipeline.
	nc.broadcaster.StartStructuredLogging(0)
	klog.Infof("Sending events to api server.")
	nc.broadcaster.StartRecordingToSink(
		&v1core.EventSinkImpl{
			Interface: v1core.New(nc.kubeClient.CoreV1().RESTClient()).Events(""),
		})
	defer nc.broadcaster.Shutdown()

	// Close node update queue to cleanup go routine.
	defer nc.nodeUpdateQueue.ShutDown()
	defer nc.podUpdateQueue.ShutDown()

	klog.Infof("Starting node controller")
	defer klog.Infof("Shutting down node controller")

	if !cache.WaitForNamedCacheSync("taint", ctx.Done(), nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {
		return
	}

	if nc.runTaintManager {
		go nc.taintManager.Run(ctx)
	}

	// Start workers to reconcile labels and/or update NoSchedule taint for nodes.
	for i := 0; i < scheduler.UpdateWorkerSize; i++ {
		// Thanks to "workqueue", each worker just need to get item from queue, because
		// the item is flagged when got from queue: if new event come, the new item will
		// be re-queued until "Done", so no more than one worker handle the same item and
		// no event missed.
		go wait.UntilWithContext(ctx, nc.doNodeProcessingPassWorker, time.Second)
	}

	for i := 0; i < podUpdateWorkerSize; i++ {
		go wait.UntilWithContext(ctx, nc.doPodProcessingWorker, time.Second)
	}

	if nc.runTaintManager {
		// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
		// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
		go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
	} else {
		// Managing eviction of nodes:
		// When we delete pods off a node, if the node was not empty at the time we then
		// queue an eviction watcher. If we hit an error, retry deletion.
		go wait.UntilWithContext(ctx, nc.doEvictionPass, scheduler.NodeEvictionPeriod)
	}

	// Incorporate the results of node health signal pushed from kubelet to master.
	go wait.UntilWithContext(ctx, func(ctx context.Context) {
		if err := nc.monitorNodeHealth(ctx); err != nil {
			klog.Errorf("Error monitoring node health: %v", err)
		}
	}, nc.nodeMonitorPeriod)

	<-ctx.Done()
}

当runTaintManager为true，Node异常之后，需要进行Pod驱逐流程如下：

nc.monitorNodeHealth:检查到Node的Ready Condition为false或unknown，将Node加入到zoneNoExecuteTainer驱逐队列
nc.doNoExcuteTaintingPass:遍历zoneNoExecuteTainter驱逐队列，如果Node的Ready Condition为false或unknown，则打上NoExecute Taint
nc.taintManager.Run:监听到Node上存在NoExecute Taint后，驱逐所有无法容忍该Taint的Pod

当runTaintManager为false，Node异常之后，会执行以下逻辑：

nc.monitorNodeHealth:如果检测到Node的Ready Condition为false或unknown，则将Node加入到zonePodEvictor驱逐队列
nc.doPodProcessingWorker:如果某个Pod对应的Node的Ready Condition为false或unknown，则将Node加入zonePodEvictor驱逐队列
nc.doEvictionPass: 遍历zonePodEvictor，驱逐各个节点上的Pod

添加NoSchedule的Taint

NodeLifecycle Controller主要是通过nc.doNoProcessingPassWorker负责根据Node的Condition维护Node 的几个NoSchedule效果的Taint，核心是通过doNoScheduleTaintingPass中

nc.nodeListener.Get

获取Node资源对象
for _,condition:=range node.Status.Conditions

根据Node Condition添加Taint
1. Ready Condition
  - 值为false：添加node.kubernetes.io/not-readyTaint
  - 值为unknown: 添加node.kubernetes.io/unreachableTaint`
2. MemoryPressure Condition
  - true：添加node.kubernetes.io/memory-pressureTaint
3. DiskPressure Condition
  - True: 添加node.kubernetes.io/disk-pressureTaint
4. PIPPressure Condition
  - true：添加node.kubernetes.io/pid-pressureTaint
5. NetworkUnavilable Condition
  - true：添加node.kubernetes.io/network-unavailableTaint
node.Spec.Unschedule

根据node.Spec.Unschedule添加Taint
SwapNodeControllerTaint

更新Node的Taint

Node健康状态检测

nc.monitorNodeHealth负责根据Node的状态，将异常的Node加入到驱逐队列

nc.nodeLister.List

获取集群中所有的Node

定时循环执行，每次对集群中的Node进行检查
c.tryUpdateNodeHealth

检查和更新Node最新的Condition
nc.getPodsAssignedToNode

获取Node上的Pod
nc.processTaintBaseEviction、nc.processNoTaintBaseEviction

判断Node的状态加入驱逐队列

使用了NoExecute Taint驱逐Node上的Pod

如果运行了Taint Manager，异常的Node会被nc.dcNoExecuteTaintingPass协程加NoExecute Taint，再有nc.taintManager.Run协程驱逐这些不能容忍的Taint的Pod

nc.dcNoExecuteTaintingPass

zoneNoExecuteTainterKeys

获取所有Zone的名称
zoneNoExecuteTainterWorker.Try

为每个Zone的每个Node添加Taint
1. nc.nodeLister.Get
  
  获取Node资源对象
2. GetNodeCondition
  
  获取当前Node的Condition
3. switch condition.Status
  - Ready Condition值为false：添加node.kubernetes.io/not-readyTaint
  - Ready Condition值值为unknown: 添加node.kubernetes.io/unreachableTaint`
  - 效果都是NoExecute
4. SwapNodeControllerTaint
  
  调用kube-apiserve更新Taint

nc.taintManager.Run

tc.nodeLister.Get

获取Node资源对象
getNoExecuteTaints

获取Node上所有NoExecute Taint
tc.getPodsAssignedToNode

获取Node上的Pod
tc.processPodOnNode

驱逐Pod

直接驱逐Node上的Pod

如果没有运行Taint Manager，异常会直接驱逐Node上的Pod

zonePodEvictorKeys

获取所有Zone名称
zonePodEvictionWorker.Try

驱逐每个zone的每个Node上的Pod
1. nc.nodeLister.Get
  
  获取Node
2. nc.getPodsAssignedToNode
  
  获取Node上的Pod
3. controllerutols.DeletePods
  
  驱逐Pod
4. nc.nodeEvictionMap.setStatus
  
  将Node驱逐状态更改为Evicted