Kube-controller-manager(NodeLifecycleController)
Kube-controller-manager(NodeLifecycleController)
基于1.25
K8s的Pod分配到某个Node中运行。当Node出现异常,集群需要把Node状态更新为异常,并且驱逐上面的Pod到其他Node上运行。
- NodeLifecycle Controller负责监视Node状态,根据状态调整Node的Taint
NodeLifecycle Controller有一个--enable-taint-manager
启动参数,默认为启用状态,在1.27中移除
- 如果为true:则NodeLifecycle Controller在node异常的时候,通过在node上添加NoExecute Taint来驱逐Node上的Pod
- 如果是false:则NodeLifecycle Controller,在node异常的时候,会直接驱逐Pod,不会添加Taint
控制器初始化
-
// Controller is the controller that manages node's life cycle.
type Controller struct {
taintManager *scheduler.NoExecuteTaintManager
podLister corelisters.PodLister
podInformerSynced cache.InformerSynced
kubeClient clientset.Interface
// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
// to avoid the problem with time skew across the cluster.
now func() metav1.Time
enterPartialDisruptionFunc func(nodeNum int) float32
enterFullDisruptionFunc func(nodeNum int) float32
computeZoneStateFunc func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
// 混存集群中所有的Node。其中Key为Node的Name
knownNodeSet map[string]*v1.Node
// per Node map storing last observed health together with a local time when it was observed.
// 缓存集群中Node的健康状态,该数据结构是一个带锁的Map,其key为node的name,其value记录了Node
nodeHealthMap *nodeHealthMap
// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
evictorLock sync.Mutex
// 缓存集群中的Node的Pod驱逐情况。该数据结构是一个带锁的Map
// key为Node的Name,value是node上报的Pod驱逐情况
// Pod驱逐情况分为3种:
// 1. unmarked:不需要驱逐
// 2. toBeEvicted:需要驱逐,但是还没有执行
// 3. eviceted:已经执行过Pod驱逐
// 这个变量只有在runTaintManager设置为false的时候使用
nodeEvictionMap *nodeEvictionMap
// workers that evicts pods from unresponsive nodes.
// Node 驱逐队列,按照记录需要住区的PodNode
zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
// workers that are responsible for tainting nodes.
zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue
nodesToRetry sync.Map
zoneStates map[string]ZoneState
daemonSetStore appsv1listers.DaemonSetLister
daemonSetInformerSynced cache.InformerSynced
leaseLister coordlisters.LeaseLister
leaseInformerSynced cache.InformerSynced
nodeLister corelisters.NodeLister
nodeInformerSynced cache.InformerSynced
getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)
broadcaster record.EventBroadcaster
recorder record.EventRecorder
// Value controlling Controller monitoring period, i.e. how often does Controller
// check node health signal posted from kubelet. This value should be lower than
// nodeMonitorGracePeriod.
// TODO: Change node health monitor to watch based.
// Node健康超时事件。如果NodeLifecycle Controller 发现距离上一次kubelet上报node的状态已经超过这个字段规定的时间
// NodeLifecycle Controller就会把Node 的Condition设置为unknown
nodeMonitorPeriod time.Duration
// When node is just created, e.g. cluster bootstrap or node creation, we give
// a longer grace period.
nodeStartupGracePeriod time.Duration
// Controller will not proactively sync node health, but will monitor node
// health signal updated from kubelet. There are 2 kinds of node healthiness
// signals: NodeStatus and NodeLease. If it doesn't receive update for this amount
// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
// time before which Controller start evicting pods is controlled via flag
// 'pod-eviction-timeout'.
// Note: be cautious when changing the constant, it must work with
// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
// controller. The node health signal update frequency is the minimal of the
// two.
// There are several constraints:
// 1. nodeMonitorGracePeriod must be N times more than the node health signal
// update frequency, where N means number of retries allowed for kubelet to
// post node status/lease. It is pointless to make nodeMonitorGracePeriod
// be less than the node health signal update frequency, since there will
// only be fresh values from Kubelet at an interval of node health signal
// update frequency. The constant must be less than podEvictionTimeout.
// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
// value takes longer for user to see up-to-date node health.
nodeMonitorGracePeriod time.Duration
// 驱逐Pod前的等待时间
podEvictionTimeout time.Duration
evictionLimiterQPS float32
secondaryEvictionLimiterQPS float32
largeClusterThreshold int32
unhealthyZoneThreshold float32
// if set to true Controller will start TaintManager that will evict Pods from
// tainted nodes, if they're not tolerated.
// 对应前面的启动参数`--enable-taint-manager`
runTaintManager bool
// 监听不同的事件的node和pod
nodeUpdateQueue workqueue.Interface
podUpdateQueue workqueue.RateLimitingInterface
}
主要执行逻辑
NodeLifecycle Controller启动过程主要是六种协程:
nc.taintManager.Run:仅在runTainManager为true启动。在Node出现NoExecute Taint时,驱逐Node上所有的不能容忍这些Taint的Pod
nc.doNodeProcessingPassWorker:在Node的Condition出现异常的时候,在Node上添加No Schedule Taint,阻止新的Pod被调度上来
nc.doPodProcessingWorker:用于检查Pod对应的Node状态。在Node的Ready Condition为false或者unknown时,如果runTaintManager为false,则将Node加入到驱逐队列;如果为true,就把Pod的Ready Condition更新为false
nc.doNoExecuteTaintingPass:仅在runTaintMnager为true启动。
对于驱逐队列中的Node,根据Node的健康状态,在Node的Ready Condition设置为false或unknown时Nod添加对应No Execute Taint
nc.doEvictionPass: 仅在runTaintManager为false启动。
对于驱逐队列中的Node,根据Node健康情况,在Node的Ready Condition为false或者unknown时驱逐Node上的Pod
nc.monitorNodeHealth: 监听集群中各个Node的健康情况,更新Node的Condition状态,并且在NodeLifecycle Controller缓存Node的状态转台,把异常的Node加入到驱逐队列。该协程是一个循环执行的定时任务,间隔由node Monitor Period决定
-
// Run starts an asynchronous loop that monitors the status of cluster nodes.
func (nc *Controller) Run(ctx context.Context) {
defer utilruntime.HandleCrash()
// Start events processing pipeline.
nc.broadcaster.StartStructuredLogging(0)
klog.Infof("Sending events to api server.")
nc.broadcaster.StartRecordingToSink(
&v1core.EventSinkImpl{
Interface: v1core.New(nc.kubeClient.CoreV1().RESTClient()).Events(""),
})
defer nc.broadcaster.Shutdown()
// Close node update queue to cleanup go routine.
defer nc.nodeUpdateQueue.ShutDown()
defer nc.podUpdateQueue.ShutDown()
klog.Infof("Starting node controller")
defer klog.Infof("Shutting down node controller")
if !cache.WaitForNamedCacheSync("taint", ctx.Done(), nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {
return
}
if nc.runTaintManager {
go nc.taintManager.Run(ctx)
}
// Start workers to reconcile labels and/or update NoSchedule taint for nodes.
for i := 0; i < scheduler.UpdateWorkerSize; i++ {
// Thanks to "workqueue", each worker just need to get item from queue, because
// the item is flagged when got from queue: if new event come, the new item will
// be re-queued until "Done", so no more than one worker handle the same item and
// no event missed.
go wait.UntilWithContext(ctx, nc.doNodeProcessingPassWorker, time.Second)
}
for i := 0; i < podUpdateWorkerSize; i++ {
go wait.UntilWithContext(ctx, nc.doPodProcessingWorker, time.Second)
}
if nc.runTaintManager {
// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
} else {
// Managing eviction of nodes:
// When we delete pods off a node, if the node was not empty at the time we then
// queue an eviction watcher. If we hit an error, retry deletion.
go wait.UntilWithContext(ctx, nc.doEvictionPass, scheduler.NodeEvictionPeriod)
}
// Incorporate the results of node health signal pushed from kubelet to master.
go wait.UntilWithContext(ctx, func(ctx context.Context) {
if err := nc.monitorNodeHealth(ctx); err != nil {
klog.Errorf("Error monitoring node health: %v", err)
}
}, nc.nodeMonitorPeriod)
<-ctx.Done()
}
当runTaintManager为true,Node异常之后,需要进行Pod驱逐流程如下:
- nc.monitorNodeHealth:检查到Node的Ready Condition为false或unknown,将Node加入到zoneNoExecuteTainer驱逐队列
- nc.doNoExcuteTaintingPass:遍历zoneNoExecuteTainter驱逐队列,如果Node的Ready Condition为false或unknown,则打上NoExecute Taint
- nc.taintManager.Run:监听到Node上存在NoExecute Taint后,驱逐所有无法容忍该Taint的Pod
当runTaintManager为false,Node异常之后,会执行以下逻辑:
- nc.monitorNodeHealth:如果检测到Node的Ready Condition为false或unknown,则将Node加入到zonePodEvictor驱逐队列
- nc.doPodProcessingWorker:如果某个Pod对应的Node的Ready Condition为false或unknown,则将Node加入zonePodEvictor驱逐队列
- nc.doEvictionPass: 遍历zonePodEvictor,驱逐各个节点上的Pod
添加NoSchedule的Taint
NodeLifecycle Controller主要是通过nc.doNoProcessingPassWorker负责根据Node的Condition维护Node 的几个NoSchedule效果的Taint,核心是通过doNoScheduleTaintingPass中
nc.nodeListener.Get
获取Node资源对象
for _,condition:=range node.Status.Conditions
根据Node Condition添加Taint
- Ready Condition
- 值为false:添加
node.kubernetes.io/not-ready
Taint - 值为unknown: 添加
node.kubernetes.io/unreachable
Taint`
- 值为false:添加
- MemoryPressure Condition
- true:添加
node.kubernetes.io/memory-pressure
Taint
- true:添加
- DiskPressure Condition
- True: 添加
node.kubernetes.io/disk-pressure
Taint
- True: 添加
- PIPPressure Condition
- true:添加
node.kubernetes.io/pid-pressure
Taint
- true:添加
- NetworkUnavilable Condition
- true:添加
node.kubernetes.io/network-unavailable
Taint
- true:添加
- Ready Condition
node.Spec.Unschedule
根据node.Spec.Unschedule添加Taint
SwapNodeControllerTaint
更新Node的Taint
Node健康状态检测
nc.monitorNodeHealth负责根据Node的状态,将异常的Node加入到驱逐队列
nc.nodeLister.List
获取集群中所有的Node
定时循环执行,每次对集群中的Node进行检查
c.tryUpdateNodeHealth
检查和更新Node最新的Condition
nc.getPodsAssignedToNode
获取Node上的Pod
nc.processTaintBaseEviction、nc.processNoTaintBaseEviction
判断Node的状态加入驱逐队列
使用了NoExecute Taint驱逐Node上的Pod
如果运行了Taint Manager,异常的Node会被nc.dcNoExecuteTaintingPass协程加NoExecute Taint,再有nc.taintManager.Run协程驱逐这些不能容忍的Taint的Pod
nc.dcNoExecuteTaintingPass
zoneNoExecuteTainterKeys
获取所有Zone的名称
zoneNoExecuteTainterWorker.Try
为每个Zone的每个Node添加Taint
nc.nodeLister.Get
获取Node资源对象
GetNodeCondition
获取当前Node的Condition
switch condition.Status
- Ready Condition值为false:添加
node.kubernetes.io/not-ready
Taint - Ready Condition值值为unknown: 添加
node.kubernetes.io/unreachable
Taint` - 效果都是NoExecute
- Ready Condition值为false:添加
SwapNodeControllerTaint
调用kube-apiserve更新Taint
nc.taintManager.Run
tc.nodeLister.Get
获取Node资源对象
getNoExecuteTaints
获取Node上所有NoExecute Taint
tc.getPodsAssignedToNode
获取Node上的Pod
tc.processPodOnNode
驱逐Pod
直接驱逐Node上的Pod
如果没有运行Taint Manager,异常会直接驱逐Node上的Pod
zonePodEvictorKeys
获取所有Zone名称
zonePodEvictionWorker.Try
驱逐每个zone的每个Node上的Pod
nc.nodeLister.Get
获取Node
nc.getPodsAssignedToNode
获取Node上的Pod
controllerutols.DeletePods
驱逐Pod
nc.nodeEvictionMap.setStatus
将Node驱逐状态更改为Evicted