K8s-kubelet(Cgroup资源隔离以及垃圾回收原理)

基于1.25

什么是Cgroup资源隔离

kubelet基于cgroup限制Pod资源使用。cgroup是Linux内核的一个重要功能,用来限制、控制和分离一个进程组的资源(CPU、内存、磁盘I/O)

kubelet在创建Pod时,会将其配置的cgroups parent目录传递给容器运行时,使容器运行时创建的进程都会限制到kubelet配置父级cgroup之下。

  • kubelet负责维护Pod、QoS、Node级别的cgroup配置
  • Container级别的cgroup直接交给容器运行时实现

cgroup的层级结构

kubelet采用了四级cgroups层级架构存储

  1. Node Level cgroup

    为了保证系统运行的稳定性,kubelet支持为系统守护进程预留资源,避免Pod占用整个系统资源,造成系统卡死或者崩溃。

    默认情况下,kube-reserved和system-reserved不会启用。

    但是启用之后,需要注意守护进程添加了cgroup之后,可能导致配置的上限太小,导致守护进程资源不足退出。

    因此在实际中,进配置kube-reserved和system-reserved预留资源,限制Pod的资源使用上限,而不是启用kube-reserved和system-reserved的enforceNodeAllocatable

  2. QoS Level cgroup

    K8s中的Pod有三个等级,分别是Guaranteed、Burstable和BestEffort,kubelet会为每一种QoS创建一个cgroup,按照不同等级分到对应的cgroup管理。

  3. Pod Level cgroup

    kubelet启动Pod前,会首先计算Pod的资源使用上限,并且为其配置Pod级的cgroup资源限制

  4. Container Levl cgroup

    Container级的cgroup实际上不是kubelet创建,而是容器运行时创建。kubelet通过CRI调用容器运行时创建Pod,会被其准备好的cgroup路径传递过去

镜像垃圾回收

Overview

在K8s中,每个节点都有一个惊喜缓存,用于存储在该节点上运行的所有容器镜像。这些惊喜占用磁盘,如果不及时清理会产生磁盘不足。

  • kubelet引入ImageGCManager,定期扫描节点上的本地缓存惊喜,清理一些不再需要的镜像

ImageGCManager启动流程主要三个步骤:

  1. 实例化ImageGCManager对象
  2. 启动GarbageCollect垃圾回收协程
  3. 启动镜像探测协程和镜像缓存更新协程

实例化ImageGCManager对象

createAndInitKubelet实例化Kubelet,NewImageGCManager func通过完成了实例化

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/images/image_gc_manager.go#L155

    // NewImageGCManager instantiates a new ImageGCManager object.
    func NewImageGCManager(runtime container.Runtime, statsProvider StatsProvider, recorder record.EventRecorder, nodeRef *v1.ObjectReference, policy ImageGCPolicy, sandboxImage string) (ImageGCManager, error) {
    // Validate policy.
    if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 {
    return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent)
    }
    if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 {
    return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent)
    }
    if policy.LowThresholdPercent > policy.HighThresholdPercent {
    return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent)
    }
    im := &realImageGCManager{
    // kubelet 使用的容器运行时接口,在Image垃圾监测和清理过程中用于获取惊喜列表、查询惊喜,以及执行实际的垃圾清理
    runtime: runtime,
    // 镜像垃圾回收策略对象,提供了镜像存储率高、低阈值和镜像最小保留事件等信息,用于控制镜像垃圾回收的自动触发策略,从而保证集群中的存储空间得到最合理利用
    policy: policy,
    imageRecords: make(map[string]*imageRecord),
    // 镜像统计信息收集器,用于收集各镜像磁盘使用的统计信息,为垃圾收集和清理提供必要的参考数据
    statsProvider: statsProvider,
    // 时间记录器,用于记录镜像垃圾收集过程中的关键告警信息,并且生产时间,如磁盘容器错误
    recorder: recorder,
    // Ovject Reference引用类型对象,在ImageGCManager中指向kubelet所在的工作节点,并且作为产生事件的事件源
    nodeRef: nodeRef,
    initialized: false,
    // Sandbox的镜像名称,该镜像将始终作为保留镜像不会被清理
    sandboxImage: sandboxImage,
    }

    return im, nil
    }

启动GarbageCollect垃圾回收协程

完成ImageGCManager的初始化工作后,kubelet.StartGarbageCollection会分别启动容器和镜像的回收协程

  • 该协程ImageGCPeriod(5min),周期性进行镜像垃圾回收

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/kubelet.go#L1310

    	go wait.Until(func() {
    if err := kl.imageManager.GarbageCollect(); err != nil {
    if prevImageGCFailed {
    klog.ErrorS(err, "Image garbage collection failed multiple times in a row")
    // Only create an event for repeated failures
    kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
    } else {
    klog.ErrorS(err, "Image garbage collection failed once. Stats initialization may not have completed yet")
    }
    prevImageGCFailed = true
    } else {
    var vLevel klog.Level = 4
    if prevImageGCFailed {
    vLevel = 1
    prevImageGCFailed = false
    }

    klog.V(vLevel).InfoS("Image garbage collection succeeded")
    }
    }, ImageGCPeriod, wait.NeverStop)
    }

启动镜像探测协程和镜像缓存更新协程

在kubelet.Run启动kubelet内部依赖模块中的过程中,会调用ImageGCManager.Start func在俩个独立的协程中开启ImageGCManager的定期镜像探测和镜像缓存更新服务

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/images/image_gc_manager.go#L180

    func (im *realImageGCManager) Start() {
    go wait.Until(func() {
    // Initial detection make detected time "unknown" in the past.
    var ts time.Time
    if im.initialized {
    ts = time.Now()
    }
    _, err := im.detectImages(ts)
    if err != nil {
    klog.InfoS("Failed to monitor images", "err", err)
    } else {
    im.initialized = true
    }
    }, 5*time.Minute, wait.NeverStop)

    // Start a goroutine periodically updates image cache.
    go wait.Until(func() {
    images, err := im.runtime.ListImages()
    if err != nil {
    klog.InfoS("Failed to update image list", "err", err)
    } else {
    im.imageCache.set(images)
    }
    }, 30*time.Second, wait.NeverStop)

    }
  • Start主要启动里俩个独立协程,周期性执行以下任务

    • im.detectImages:以5min为间隔,使用CRI定期扫描当前工作节点,找到所有的镜像和Pod,更新ImageGCManger的镜像列表,使其与当前工作节点上存在的镜像保持一致,同时将仍有Pod使用的镜像更新到imagesInUse集合中。其中,im.imageRecords会作为每一轮镜像垃圾回收任务的完整镜像列表
    • im.runtime.ListImages:以30秒为间隔,使用当前工作节点上存在的镜像列表更新镜像缓存,该镜像缓存用于查询工作节点状态时返回当前工作节点的完整镜像列表

镜像垃圾回收原理

主要是俩个步骤:

  1. GarbageCollect:收集用于垃圾回收的镜像磁盘信息,计算可回收的磁盘空间,启动镜像垃圾回收
  2. freeSpace:找到工作节点上待回收的镜像列表,一次进行镜像清理,知道回收足够的磁盘空间或完成全部待回收的镜像的清理

GarbageCollect:收集用于垃圾回收的镜像磁盘信息,计算可回收的磁盘空间,启动镜像垃圾回收

  • 为了避免频繁进行镜像垃圾回收,ImageGCManager设置了策略对象policy设置了触发镜像垃圾回收的磁盘使用上限率(HighThresholdPercent,默认85%)和下限LowThresholdPercent,默认80%,可以在kubelet启动的时候通过--image-gc-high-threshold--image-gc-low-threshold参数指定,当磁盘使用率高于HighThresholdPercent才进行回收,回收率低于LowThresholdPercent停止

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/images/image_gc_manager.go#L279

    func (im *realImageGCManager) GarbageCollect() error {
    // Get disk usage on disk holding images.
    fsStats, err := im.statsProvider.ImageFsStats()
    if err != nil {
    return err
    }

    var capacity, available int64
    if fsStats.CapacityBytes != nil {
    capacity = int64(*fsStats.CapacityBytes)
    }
    if fsStats.AvailableBytes != nil {
    available = int64(*fsStats.AvailableBytes)
    }

    if available > capacity {
    klog.InfoS("Availability is larger than capacity", "available", available, "capacity", capacity)
    available = capacity
    }

    // Check valid capacity.
    if capacity == 0 {
    err := goerrors.New("invalid capacity 0 on image filesystem")
    im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
    return err
    }

    // If over the max threshold, free enough to place us at the lower threshold.
    usagePercent := 100 - int(available*100/capacity)
    if usagePercent >= im.policy.HighThresholdPercent {
    amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
    klog.InfoS("Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold", "usage", usagePercent, "highThreshold", im.policy.HighThresholdPercent, "amountToFree", amountToFree, "lowThreshold", im.policy.LowThresholdPercent)
    freed, err := im.freeSpace(amountToFree, time.Now())
    if err != nil {
    return err
    }

    if freed < amountToFree {
    err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d bytes, but freed %d bytes", amountToFree, freed)
    im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
    return err
    }
    }

    return nil
    }

    freeSpace:找到工作节点上待回收的镜像列表,一次进行镜像清理,知道回收足够的磁盘空间或完成全部待回收的镜像的清理

    1. 首先,获取到所有使用中的镜像imageInUse,包含Sandbox镜像(kubelet使用--pod-infra-container-image指定),和当前工作节点使用的Pod的镜像
    2. 通过runtime. ListImages获取到工作节点上所有的镜像,并且同步到imageReconrds
      • 发现新的镜像,把镜像的发现时间设置为探测时间detectTime
      • 如果是使用中的镜像,则把镜像的最近使用时间设置为当前时间
      • 从imageReconrds移除工作节点上不存在的镜像
    3. 如果镜像的最后使用时间早于本轮回收的开始时间,并当前时间距离镜像的首次演出探测时间已经超过了镜像保护时间(kubelet的--minimum-image-ttl-duration,默认2min),使用runtime.RemoveImage清理工作节点上的当前镜像,并且删除imageRecords记录
    • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/images/image_gc_manager.go#L212

      func (im *realImageGCManager) detectImages(detectTime time.Time) (sets.String, error) {
      imagesInUse := sets.NewString()

      // Always consider the container runtime pod sandbox image in use
      imageRef, err := im.runtime.GetImageRef(container.ImageSpec{Image: im.sandboxImage})
      if err == nil && imageRef != "" {
      imagesInUse.Insert(imageRef)
      }

      images, err := im.runtime.ListImages()
      if err != nil {
      return imagesInUse, err
      }
      pods, err := im.runtime.GetPods(true)
      if err != nil {
      return imagesInUse, err
      }

      // Make a set of images in use by containers.
      for _, pod := range pods {
      for _, container := range pod.Containers {
      klog.V(5).InfoS("Container uses image", "pod", klog.KRef(pod.Namespace, pod.Name), "containerName", container.Name, "containerImage", container.Image, "imageID", container.ImageID)
      imagesInUse.Insert(container.ImageID)
      }
      }

      // Add new images and record those being used.
      now := time.Now()
      currentImages := sets.NewString()
      im.imageRecordsLock.Lock()
      defer im.imageRecordsLock.Unlock()
      for _, image := range images {
      klog.V(5).InfoS("Adding image ID to currentImages", "imageID", image.ID)
      currentImages.Insert(image.ID)

      // New image, set it as detected now.
      if _, ok := im.imageRecords[image.ID]; !ok {
      klog.V(5).InfoS("Image ID is new", "imageID", image.ID)
      im.imageRecords[image.ID] = &imageRecord{
      firstDetected: detectTime,
      }
      }

      // Set last used time to now if the image is being used.
      if isImageUsed(image.ID, imagesInUse) {
      klog.V(5).InfoS("Setting Image ID lastUsed", "imageID", image.ID, "lastUsed", now)
      im.imageRecords[image.ID].lastUsed = now
      }

      klog.V(5).InfoS("Image ID has size", "imageID", image.ID, "size", image.Size)
      im.imageRecords[image.ID].size = image.Size

      klog.V(5).InfoS("Image ID is pinned", "imageID", image.ID, "pinned", image.Pinned)
      im.imageRecords[image.ID].pinned = image.Pinned
      }

      // Remove old images from our records.
      for image := range im.imageRecords {
      if !currentImages.Has(image) {
      klog.V(5).InfoS("Image ID is no longer present; removing from imageRecords", "imageID", image)
      delete(im.imageRecords, image)
      }
      }

      return imagesInUse, nil
      }

容器垃圾回收

OverView

在K8s中,每个容器都需要占用系统资源,如果已经死亡的容器没有被及时清理,就会持续占用系统资源。

  • kubelet会启动专门的协程,定期(默认每分钟),扫描并且并且回收工作节点上的垃圾容器。

容器垃圾回收基于ContainerGC实现,

主要流程:

  1. 初始化ContainerGC依赖对象,包括容器垃圾回收策略GCPolicy,用于提供容器运行时查询和操作的容器运行时管理器KubeGenericRuntimeManger,以及kubelet配置源状态查看器SourcesReady Provider
  2. 实例化ContainerGC对象
  3. 启动容器垃圾回收协程

初始化ContainerGC依赖对象

  1. 初始化容器垃圾回收策略GCPolicy

    容器垃圾回收策略是在构造kubelet的NewMainKubelet func进行初始化的

    • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/container/container_gc.go#L27

      // GCPolicy specifies a policy for garbage collecting containers.
      type GCPolicy struct {
      // Minimum age at which a container can be garbage collected, zero for no limit.
      // 容器处于非运行状态后可以被垃圾回收的最小年龄,如果为0,则没有限制。
      // 这个字段由`--minmun-container-ttl-duration`启动参数指定,默认0
      MinAge time.Duration

      // Max number of dead containers any single pod (UID, container name) pair is
      // allowed to have, less than zero for no limit.
      // 每个Pod允许运行的Pod存在的最大运行容器数量,如果小于0,则表示没有限制
      // 通过--maximum-dead-contaniers-per-container 默认值1
      MaxPerPodContainer int

      // Max number of total dead containers, less than zero for no limit.
      // 工作节点上可以容忍的最大非正常容器数量,如果小小于0,则表示没有限制
      // ·--maximum-dead-contaniers·指定,默认-1
      MaxContainers int
      }
  2. 初始化通用容器运行时管理器KubeGenericRuntimeManger

    KubeGenericRuntimeManger是Runtime接口的默认实现,封装了对Pod和容器的常见操作方法,管理Pod容器生命周期,底层通过CRI调用RemoteContainerRuntime实现对容器运行时的操作

  3. 初始化kubelet配置源状态查看器SourcesReadyProvider

    SourcesReadyProvider接口用于确定配置源是否准备就绪,在容器垃圾回收器中,它提供kubelet配置源的就绪情况。其中,AllReadyfunc用于确定是否所有的配置源都已准备就绪,不同的查询结果会使用不同的容器垃圾回收策略

    SourcesReadyProvider初始化后的实例对象为sourcesImpl,它使用SeenAllSources func来判断是否所有的配置源(可能的配置源File、HTTP和kube-apiserver)都准备就绪

    • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/config/config.go#L94

      // SeenAllSources returns true if seenSources contains all sources in the
      // config, and also this config has received a SET message from each source.
      func (c *PodConfig) SeenAllSources(seenSources sets.String) bool {
      if c.pods == nil {
      return false
      }
      c.sourcesLock.Lock()
      defer c.sourcesLock.Unlock()
      klog.V(5).InfoS("Looking for sources, have seen", "sources", c.sources.List(), "seenSources", seenSources)
      return seenSources.HasAll(c.sources.List()...) && c.pods.seenSources(c.sources.List()...)
      }

实例化ContainerGC对象

容器垃圾回收器实现类为realContainerGC,它使用通用容器运行时管理器KubeGenericRuntimeManager完成垃圾回收,并且在容器垃圾回收过程中使用GCPolicy策略和SourceReadyProiver配置源状态查看器作为相关策略和信息的来源

启动容器垃圾回收协程

容器垃圾回收协程realContainerGC的核心func为GarbageCollect,kubelet在启动垃圾协程回收StartGarbageCollection过程中,会同时启动镜像垃圾回收协程和容器垃圾回收协程

  • 容器垃圾回收协程一周期性(默认1min)周期性执行进行垃圾回收。

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/kubelet.go#L1285

    // StartGarbageCollection starts garbage collection threads.
    func (kl *Kubelet) StartGarbageCollection() {
    loggedContainerGCFailure := false
    go wait.Until(func() {
    if err := kl.containerGC.GarbageCollect(); err != nil {
    klog.ErrorS(err, "Container garbage collection failed")
    kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ContainerGCFailed, err.Error())
    loggedContainerGCFailure = true
    } else {
    var vLevel klog.Level = 4
    if loggedContainerGCFailure {
    vLevel = 1
    loggedContainerGCFailure = false
    }

    klog.V(vLevel).InfoS("Container garbage collection succeeded")
    }
    }, ContainerGCPeriod, wait.NeverStop)

    // when the high threshold is set to 100, stub the image GC manager
    if kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {
    klog.V(2).InfoS("ImageGCHighThresholdPercent is set 100, Disable image GC")
    return
    }

    prevImageGCFailed := false
    go wait.Until(func() {
    if err := kl.imageManager.GarbageCollect(); err != nil {
    if prevImageGCFailed {
    klog.ErrorS(err, "Image garbage collection failed multiple times in a row")
    // Only create an event for repeated failures
    kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
    } else {
    klog.ErrorS(err, "Image garbage collection failed once. Stats initialization may not have completed yet")
    }
    prevImageGCFailed = true
    } else {
    var vLevel klog.Level = 4
    if prevImageGCFailed {
    vLevel = 1
    prevImageGCFailed = false
    }

    klog.V(vLevel).InfoS("Image garbage collection succeeded")
    }
    }, ImageGCPeriod, wait.NeverStop)
    }

ContainerGC执行垃圾回收执行GarbageCollect func中实现,该函数根据回收策略和Pod配置源可用状态,通过3个步骤完成容器垃圾回收:

  • 清理无用容器(evictContainers)
  • 清理无效的Sandbox容器(evictSandboxes)
  • 清理失效日志目录(evictPodLogsDirectories)
  1. 清理无用容器(evictContainers)

    ContainerGC会遍历工作节点上的容器来找到所有可被清理的容器,并且以PodUID和ConatinerName的组合作为关键字进行分组

    • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/kuberuntime/kuberuntime_gc.go#L186

      // evictableContainers gets all containers that are evictable. Evictable containers are: not running
      // and created more than MinAge ago.
      func (cgc *containerGC) evictableContainers(minAge time.Duration) (containersByEvictUnit, error) {
      containers, err := cgc.manager.getKubeletContainers(true)
      if err != nil {
      return containersByEvictUnit{}, err
      }

      evictUnits := make(containersByEvictUnit)
      newestGCTime := time.Now().Add(-minAge)
      for _, container := range containers {
      // Prune out running containers.
      if container.State == runtimeapi.ContainerState_CONTAINER_RUNNING {
      continue
      }

      createdAt := time.Unix(0, container.CreatedAt)
      if newestGCTime.Before(createdAt) {
      continue
      }

      labeledInfo := getContainerInfoFromLabels(container.Labels)
      containerInfo := containerGCInfo{
      id: container.Id,
      name: container.Metadata.Name,
      createTime: createdAt,
      unknown: container.State == runtimeapi.ContainerState_CONTAINER_UNKNOWN,
      }
      key := evictUnit{
      uid: labeledInfo.PodUID,
      name: containerInfo.name,
      }
      evictUnits[key] = append(evictUnits[key], containerInfo)
      }

      return evictUnits, nil
      }

    evictableContainer func通过getKubeletContainers拿到当前工作节点上的全部容器,再排除满足以下任意条件的容器,将剩余的容器作为可被清理的备选容器列表:

    • 处于运行容器,级container.State==runtimeapi.ContainerState_CONTAINER_RUNNING
    • 最近创建的容器,即创建时间晚于垃圾回收最小年龄GCPolicy.minAge的容器,time.Now-container.CreateAt(容器垃圾回收策略默认minAge为0,因此启动kubelet)

    筛选出来的可被清理的备选容器以evictUnit(PodUID和ContanerName的组合)作为关键字进行分组,并且在evictContainers func进行清理

    • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/kuberuntime/kuberuntime_gc.go#L223

      // evict all containers that are evictable
      func (cgc *containerGC) evictContainers(gcPolicy kubecontainer.GCPolicy, allSourcesReady bool, evictNonDeletedPods bool) error {
      // Separate containers by evict units.
      evictUnits, err := cgc.evictableContainers(gcPolicy.MinAge)
      if err != nil {
      return err
      }

      // Remove deleted pod containers if all sources are ready.
      if allSourcesReady {
      for key, unit := range evictUnits {
      if cgc.podStateProvider.ShouldPodContentBeRemoved(key.uid) || (evictNonDeletedPods && cgc.podStateProvider.ShouldPodRuntimeBeRemoved(key.uid)) {
      cgc.removeOldestN(unit, len(unit)) // Remove all.
      delete(evictUnits, key)
      }
      }
      }

      // Enforce max containers per evict unit.
      if gcPolicy.MaxPerPodContainer >= 0 {
      cgc.enforceMaxContainersPerEvictUnit(evictUnits, gcPolicy.MaxPerPodContainer)
      }

      // Enforce max total number of containers.
      if gcPolicy.MaxContainers >= 0 && evictUnits.NumContainers() > gcPolicy.MaxContainers {
      // Leave an equal number of containers per evict unit (min: 1).
      numContainersPerEvictUnit := gcPolicy.MaxContainers / evictUnits.NumEvictUnits()
      if numContainersPerEvictUnit < 1 {
      numContainersPerEvictUnit = 1
      }
      cgc.enforceMaxContainersPerEvictUnit(evictUnits, numContainersPerEvictUnit)

      // If we still need to evict, evict oldest first.
      numContainers := evictUnits.NumContainers()
      if numContainers > gcPolicy.MaxContainers {
      flattened := make([]containerGCInfo, 0, numContainers)
      for key := range evictUnits {
      flattened = append(flattened, evictUnits[key]...)
      }
      sort.Sort(byCreated(flattened))

      cgc.removeOldestN(flattened, numContainers-gcPolicy.MaxContainers)
      }
      }
      return nil
      }

evictContainers func会依次对可被清理的备选容器进行以下的清理操作,知道足够多的容器

  • 如果kubelet所有配置源的都是可用的(allSourcesReady),则删除标记的为驱逐或删除且已经终止的Pod对应的容器组中的所有容器
  • 如果GCPolicy策略中设置了evictUnit最大非运行容器数量MaxPerPodContainer,则遍历evictUnit,每个evictUnit只保留最新的MaxPerPodContainer个非运行状态容器,并且清理其余更旧的容器
  • 如果GCPolicy策略中设置了可以容忍的最多非正常容器忽视了MaxContainers,则将所有容器按创建时间,清理更久的容器,以使剩余的非正常数量的数量不多于MaxContrainers
  1. 清理无效的Sandbox容器(evict Sandboxes)

    完成必要的容器垃圾清理,ContainerGC会清理Sandbox容器

    • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/kuberuntime/kuberuntime_gc.go#L275

      // evictSandboxes remove all evictable sandboxes. An evictable sandbox must
      // meet the following requirements:
      // 1. not in ready state
      // 2. contains no containers.
      // 3. belong to a non-existent (i.e., already removed) pod, or is not the
      // most recently created sandbox for the pod.
      func (cgc *containerGC) evictSandboxes(evictNonDeletedPods bool) error {
      containers, err := cgc.manager.getKubeletContainers(true)
      if err != nil {
      return err
      }

      sandboxes, err := cgc.manager.getKubeletSandboxes(true)
      if err != nil {
      return err
      }

      // collect all the PodSandboxId of container
      sandboxIDs := sets.NewString()
      for _, container := range containers {
      sandboxIDs.Insert(container.PodSandboxId)
      }

      sandboxesByPod := make(sandboxesByPodUID)
      for _, sandbox := range sandboxes {
      podUID := types.UID(sandbox.Metadata.Uid)
      sandboxInfo := sandboxGCInfo{
      id: sandbox.Id,
      createTime: time.Unix(0, sandbox.CreatedAt),
      }

      // Set ready sandboxes to be active.
      if sandbox.State == runtimeapi.PodSandboxState_SANDBOX_READY {
      sandboxInfo.active = true
      }

      // Set sandboxes that still have containers to be active.
      if sandboxIDs.Has(sandbox.Id) {
      sandboxInfo.active = true
      }

      sandboxesByPod[podUID] = append(sandboxesByPod[podUID], sandboxInfo)
      }

      for podUID, sandboxes := range sandboxesByPod {
      if cgc.podStateProvider.ShouldPodContentBeRemoved(podUID) || (evictNonDeletedPods && cgc.podStateProvider.ShouldPodRuntimeBeRemoved(podUID)) {
      // Remove all evictable sandboxes if the pod has been removed.
      // Note that the latest dead sandbox is also removed if there is
      // already an active one.
      cgc.removeOldestNSandboxes(sandboxes, len(sandboxes))
      } else {
      // Keep latest one if the pod still exists.
      cgc.removeOldestNSandboxes(sandboxes, len(sandboxes)-1)
      }
      }
      return nil
      }
  2. 清理失效的日志记录(evictPodLogsDurecroies)

    容器垃圾回收的最后一步是清理失效日志记录,包括Pod日志和symlink符号连接目录

    如果kubelet所有数据源都是就绪状态,则evictPodLogsDirectiries遍历podLogsRootDirectoy日志根目录下的所有Pod日志记录,识别非正常状态的Pod,删除对应的日志目录。对于容器日志记录legacyContainerLogDir,evictPodLogsDirecories会把符号连接目标文件不存在且处于退出状态的容器对应的符号链接文件删除。

    • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/kubelet/kuberuntime/kuberuntime_gc.go#L329

      // evictPodLogsDirectories evicts all evictable pod logs directories. Pod logs directories
      // are evictable if there are no corresponding pods.
      func (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error {
      osInterface := cgc.manager.osInterface
      if allSourcesReady {
      // Only remove pod logs directories when all sources are ready.
      dirs, err := osInterface.ReadDir(podLogsRootDirectory)
      if err != nil {
      return fmt.Errorf("failed to read podLogsRootDirectory %q: %v", podLogsRootDirectory, err)
      }
      for _, dir := range dirs {
      name := dir.Name()
      podUID := parsePodUIDFromLogsDirectory(name)
      if !cgc.podStateProvider.ShouldPodContentBeRemoved(podUID) {
      continue
      }
      klog.V(4).InfoS("Removing pod logs", "podUID", podUID)
      err := osInterface.RemoveAll(filepath.Join(podLogsRootDirectory, name))
      if err != nil {
      klog.ErrorS(err, "Failed to remove pod logs directory", "path", name)
      }
      }
      }

      // Remove dead container log symlinks.
      // TODO(random-liu): Remove this after cluster logging supports CRI container log path.
      logSymlinks, _ := osInterface.Glob(filepath.Join(legacyContainerLogsDir, fmt.Sprintf("*.%s", legacyLogSuffix)))
      for _, logSymlink := range logSymlinks {
      if _, err := osInterface.Stat(logSymlink); os.IsNotExist(err) {
      if containerID, err := getContainerIDFromLegacyLogSymlink(logSymlink); err == nil {
      resp, err := cgc.manager.runtimeService.ContainerStatus(containerID, false)
      if err != nil {
      // TODO: we should handle container not found (i.e. container was deleted) case differently
      // once https://github.com/kubernetes/kubernetes/issues/63336 is resolved
      klog.InfoS("Error getting ContainerStatus for containerID", "containerID", containerID, "err", err)
      } else {
      status := resp.GetStatus()
      if status == nil {
      klog.V(4).InfoS("Container status is nil")
      continue
      }
      if status.State != runtimeapi.ContainerState_CONTAINER_EXITED {
      // Here is how container log rotation works (see containerLogManager#rotateLatestLog):
      //
      // 1. rename current log to rotated log file whose filename contains current timestamp (fmt.Sprintf("%s.%s", log, timestamp))
      // 2. reopen the container log
      // 3. if #2 fails, rename rotated log file back to container log
      //
      // There is small but indeterministic amount of time during which log file doesn't exist (between steps #1 and #2, between #1 and #3).
      // Hence the symlink may be deemed unhealthy during that period.
      // See https://github.com/kubernetes/kubernetes/issues/52172
      //
      // We only remove unhealthy symlink for dead containers
      klog.V(5).InfoS("Container is still running, not removing symlink", "containerID", containerID, "path", logSymlink)
      continue
      }
      }
      } else {
      klog.V(4).InfoS("Unable to obtain container ID", "err", err)
      }
      err := osInterface.Remove(logSymlink)
      if err != nil {
      klog.ErrorS(err, "Failed to remove container log dead symlink", "path", logSymlink)
      } else {
      klog.V(4).InfoS("Removed symlink", "path", logSymlink)
      }
      }
      }
      return nil
      }