K8s核心资源对象-Pod(QoS与驱逐顺序)

基于1.25

什么是QoS

Qos的三个选值:

  • Guaranteed:最严格的最不可怜面临驱逐,保证它们不会退出,直到超过它们的限制或没有可以从节点抢占的低优先级Pod

    • 要求每一个Pod容器都要有limits和requests
    • limits和requests必须相等
  • Burstable:Pod基于一些请求的下限保证,不需要特定的限制

    • 不指定,使用None,允许Pod可用时候灵活增加资源

    • 当Node资源压力导致Pod被驱逐,只有在所有的BestEffort类型的Pod被驱逐之后,Burstable才会被驱逐

    • 要求至少有一个Pod有CPU requests和limits

  • BestEffort:如果节点面临资源压力,kubelete更愿意驱逐BestEffort的Pod

    • 所有容器都没有limits和requests

QoS的计算

参与QoS:普通容器和Init容器,临时容器不参与。

流程如下:

  • 把普通容器、Init容器追加到allContainers中

  • 遍历Pod中所有参与计算的容器,获取requets,通过isSupportedQoSComputeResoure func过滤不参与QoS计算的资源。

    • 只有cpu和memory,计算之后追加到ResourceList
  • limits的计算方法和requests相同。另外limits没有配置的资源添加到qosLimitFound这个map

  • len(requests)==0 &&len(limits)==0,所有容器没有配置requets和limits,属于BestEffort

  • 当qosLimitsFound 没有包含 cpu 和 memory ,一定不是Cuaranteed类型,把isGuaranteed设置为false

  • 如果是isGuaranteed 是true,即同时配置了requests和limits,并且不满足lim.Cmp(req)!=0。即同时配置了requests和limits,是Guaranteed

  • 不属于BestEffort 和 Guarantee的情况 都属于Burstable

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/apis/core/v1/helper/qos/qos.go#L39


// GetPodQOS returns the QoS class of a pod.
// A pod is besteffort if none of its containers have specified any requests or limits.
// A pod is guaranteed only when requests and limits are specified for all the containers and they are equal.
// A pod is burstable if limits and requests do not match across all containers.
func GetPodQOS(pod *v1.Pod) v1.PodQOSClass {
requests := v1.ResourceList{}
limits := v1.ResourceList{}
zeroQuantity := resource.MustParse("0")
isGuaranteed := true
allContainers := []v1.Container{}
allContainers = append(allContainers, pod.Spec.Containers...)
allContainers = append(allContainers, pod.Spec.InitContainers...)
for _, container := range allContainers {
// process requests
for name, quantity := range container.Resources.Requests {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
delta := quantity.DeepCopy()
if _, exists := requests[name]; !exists {
requests[name] = delta
} else {
delta.Add(requests[name])
requests[name] = delta
}
}
}
// process limits
qosLimitsFound := sets.NewString()
for name, quantity := range container.Resources.Limits {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
qosLimitsFound.Insert(string(name))
delta := quantity.DeepCopy()
if _, exists := limits[name]; !exists {
limits[name] = delta
} else {
delta.Add(limits[name])
limits[name] = delta
}
}
}

if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) {
isGuaranteed = false
}
}
if len(requests) == 0 && len(limits) == 0 {
return v1.PodQOSBestEffort
}
// Check is requests match limits for all resources.
if isGuaranteed {
for name, req := range requests {
if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
isGuaranteed = false
break
}
}
}
if isGuaranteed &&
len(requests) == len(limits) {
return v1.PodQOSGuaranteed
}
return v1.PodQOSBurstable
}

QoS影响kubelet驱逐Pod

下面的参数影响了Pod驱逐顺序:

  1. Pod资源使用量是否超过起请求量
  2. Pod的优先级
  3. Pod的相对于请求的资源使用情况
// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
// It ranks by whether or not the pod's usage exceeds its requests, then by priority, and
// finally by memory usage above requests.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
orderedBy(exceedMemoryRequests(stats), priority, memory(stats)).Sort(pods)
}

QoS影响Linux的OOM Killer

当kubelet没来得及触发Pod驱逐,使得节点资源耗尽,将触发节点的Linux上的OOM killer。

  • 根据一定算法,选择性终止一些进程,使得程序继续运行

  • 终止的进程,计算每一个进程的点数,点数范围是0~1000

    • 点数越高,进程越有可能被终止
    • 进程的OOM点数等于oom_score+oom_score_adj;oom_score和进程消耗的内存有关,oom_score_adj是可以配置的,取值范围-1000~1000
  • 根据QoS不同,kubelete的oom_socre_adj分数不一样

    • Guaranteed:-997,基本上最后才会被OOM killer终止
    • BestEffort:1000,最先被OOM killer 终止

oom_score_adj的定义初始值


const (
// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
KubeletOOMScoreAdj int = -999
// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
KubeProxyOOMScoreAdj int = -999
guaranteedOOMScoreAdj int = -997
besteffortOOMScoreAdj int = 1000
)

oom_score_adj动态设置


// GetContainerOOMScoreAdjust returns the amount by which the OOM score of all processes in the
// container should be adjusted.
// The OOM score of a process is the percentage of memory it consumes
// multiplied by 10 (barring exceptional cases) + a configurable quantity which is between -1000
// and 1000. Containers with higher OOM scores are killed if the system runs out of memory.
// See https://lwn.net/Articles/391222/ for more information.
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
if types.IsNodeCriticalPod(pod) {
// Only node critical pod should be the last to get killed.
return guaranteedOOMScoreAdj
}

switch v1qos.GetPodQOS(pod) {
case v1.PodQOSGuaranteed:
// Guaranteed containers should be the last to get killed.
return guaranteedOOMScoreAdj
case v1.PodQOSBestEffort:
return besteffortOOMScoreAdj
}

// Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,
// we want to protect Burstable containers that consume less memory than requested.
// The formula below is a heuristic. A container requesting for 10% of a system's
// memory will have an OOM score adjust of 900. If a process in container Y
// uses over 10% of memory, its OOM score will be 1000. The idea is that containers
// which use more than their request will have an OOM score of 1000 and will be prime
// targets for OOM kills.
// Note that this is a heuristic, it won't work if a container has many small processes.
memoryRequest := container.Resources.Requests.Memory().Value()
oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
// A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure
// that burstable pods have a higher OOM score adjustment.
if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
return (1000 + guaranteedOOMScoreAdj)
}
// Give burstable pods a higher chance of survival over besteffort pods.
if int(oomScoreAdjust) == besteffortOOMScoreAdj {
return int(oomScoreAdjust - 1)
}
return int(oomScoreAdjust)
}