K8s核心资源对象-Pod（QoS与驱逐顺序）

基于1.25

什么是QoS

Qos的三个选值：

Guaranteed：最严格的最不可怜面临驱逐，保证它们不会退出，直到超过它们的限制或没有可以从节点抢占的低优先级Pod
- 要求每一个Pod容器都要有limits和requests
- limits和requests必须相等
Burstable：Pod基于一些请求的下限保证，不需要特定的限制
- 不指定，使用None，允许Pod可用时候灵活增加资源
- 当Node资源压力导致Pod被驱逐，只有在所有的BestEffort类型的Pod被驱逐之后，Burstable才会被驱逐
- 要求至少有一个Pod有CPU requests和limits
BestEffort：如果节点面临资源压力，kubelete更愿意驱逐BestEffort的Pod
- 所有容器都没有limits和requests

QoS的计算

参与QoS：普通容器和Init容器，临时容器不参与。

流程如下：

把普通容器、Init容器追加到allContainers中
遍历Pod中所有参与计算的容器，获取requets，通过isSupportedQoSComputeResoure func过滤不参与QoS计算的资源。
- 只有cpu和memory，计算之后追加到ResourceList
limits的计算方法和requests相同。另外limits没有配置的资源添加到qosLimitFound这个map
len(requests)==0 &&len(limits)==0,所有容器没有配置requets和limits，属于BestEffort
当qosLimitsFound 没有包含 cpu 和 memory ，一定不是Cuaranteed类型，把isGuaranteed设置为false
如果是isGuaranteed 是true，即同时配置了requests和limits，并且不满足lim.Cmp(req)!=0。即同时配置了requests和limits，是Guaranteed
不属于BestEffort 和 Guarantee的情况都属于Burstable
Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/apis/core/v1/helper/qos/qos.go#L39


// GetPodQOS returns the QoS class of a pod.
// A pod is besteffort if none of its containers have specified any requests or limits.
// A pod is guaranteed only when requests and limits are specified for all the containers and they are equal.
// A pod is burstable if limits and requests do not match across all containers.
func GetPodQOS(pod *v1.Pod) v1.PodQOSClass {
	requests := v1.ResourceList{}
	limits := v1.ResourceList{}
	zeroQuantity := resource.MustParse("0")
	isGuaranteed := true
	allContainers := []v1.Container{}
	allContainers = append(allContainers, pod.Spec.Containers...)
	allContainers = append(allContainers, pod.Spec.InitContainers...)
	for _, container := range allContainers {
		// process requests
		for name, quantity := range container.Resources.Requests {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				delta := quantity.DeepCopy()
				if _, exists := requests[name]; !exists {
					requests[name] = delta
				} else {
					delta.Add(requests[name])
					requests[name] = delta
				}
			}
		}
		// process limits
		qosLimitsFound := sets.NewString()
		for name, quantity := range container.Resources.Limits {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				qosLimitsFound.Insert(string(name))
				delta := quantity.DeepCopy()
				if _, exists := limits[name]; !exists {
					limits[name] = delta
				} else {
					delta.Add(limits[name])
					limits[name] = delta
				}
			}
		}

		if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) {
			isGuaranteed = false
		}
	}
	if len(requests) == 0 && len(limits) == 0 {
		return v1.PodQOSBestEffort
	}
	// Check is requests match limits for all resources.
	if isGuaranteed {
		for name, req := range requests {
			if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
				isGuaranteed = false
				break
			}
		}
	}
	if isGuaranteed &&
		len(requests) == len(limits) {
		return v1.PodQOSGuaranteed
	}
	return v1.PodQOSBurstable
}

QoS影响kubelet驱逐Pod

下面的参数影响了Pod驱逐顺序：

Pod资源使用量是否超过起请求量
Pod的优先级
Pod的相对于请求的资源使用情况

Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/eviction/helpers.go#L628C1-L631C2

// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
// It ranks by whether or not the pod's usage exceeds its requests, then by priority, and
// finally by memory usage above requests.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
	orderedBy(exceedMemoryRequests(stats), priority, memory(stats)).Sort(pods)
}

QoS影响Linux的OOM Killer

当kubelet没来得及触发Pod驱逐，使得节点资源耗尽，将触发节点的Linux上的OOM killer。

根据一定算法，选择性终止一些进程，使得程序继续运行
终止的进程，计算每一个进程的点数，点数范围是0～1000
- 点数越高，进程越有可能被终止
- 进程的OOM点数等于oom_score+oom_score_adj；oom_score和进程消耗的内存有关，oom_score_adj是可以配置的，取值范围-1000～1000
根据QoS不同，kubelete的oom_socre_adj分数不一样
- Guaranteed：-997，基本上最后才会被OOM killer终止
- BestEffort：1000，最先被OOM killer 终止

oom_score_adj的定义初始值

Ref：https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/qos/policy.go#L25


const (
	// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
	KubeletOOMScoreAdj int = -999
	// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
	KubeProxyOOMScoreAdj  int = -999
	guaranteedOOMScoreAdj int = -997
	besteffortOOMScoreAdj int = 1000
)

oom_score_adj动态设置

如果是系统级别的Pod，设置为guaranteedOOMScoreAdj
如果Pod的QoS为Guaranteed，设置为guaranteedOOMScoreAdj
如果Pod的QoS为BestEffort，设置为besteffortOOMScoreAdj，值为1000，最先被OOM killer
如果Pod的QoS为Bursttable，根据其请求的系统资源比例，动态调整
- 如果容器请求10%的memory，OOM设置为900
- 如果超过百分之内存，OOM设置为1000
Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/qos/policy.go#L40


// GetContainerOOMScoreAdjust returns the amount by which the OOM score of all processes in the
// container should be adjusted.
// The OOM score of a process is the percentage of memory it consumes
// multiplied by 10 (barring exceptional cases) + a configurable quantity which is between -1000
// and 1000. Containers with higher OOM scores are killed if the system runs out of memory.
// See https://lwn.net/Articles/391222/ for more information.
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
	if types.IsNodeCriticalPod(pod) {
		// Only node critical pod should be the last to get killed.
		return guaranteedOOMScoreAdj
	}

	switch v1qos.GetPodQOS(pod) {
	case v1.PodQOSGuaranteed:
		// Guaranteed containers should be the last to get killed.
		return guaranteedOOMScoreAdj
	case v1.PodQOSBestEffort:
		return besteffortOOMScoreAdj
	}

	// Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,
	// we want to protect Burstable containers that consume less memory than requested.
	// The formula below is a heuristic. A container requesting for 10% of a system's
	// memory will have an OOM score adjust of 900. If a process in container Y
	// uses over 10% of memory, its OOM score will be 1000. The idea is that containers
	// which use more than their request will have an OOM score of 1000 and will be prime
	// targets for OOM kills.
	// Note that this is a heuristic, it won't work if a container has many small processes.
	memoryRequest := container.Resources.Requests.Memory().Value()
	oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
	// A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure
	// that burstable pods have a higher OOM score adjustment.
	if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
		return (1000 + guaranteedOOMScoreAdj)
	}
	// Give burstable pods a higher chance of survival over besteffort pods.
	if int(oomScoreAdjust) == besteffortOOMScoreAdj {
		return int(oomScoreAdjust - 1)
	}
	return int(oomScoreAdjust)
}