K8s核心资源对象-Pod(状态以及原地升级)

基于1.25

Pod的PodStatus

// PodStatus represents information about the status of a pod. Status may trail the actual
// state of a system.
type PodStatus struct {
// +optional
// Pod生命周期,包含Pending、Running、Succeeded、Failed和Unknown
Phase PodPhase
// +optional
// Pod的多个状态条件,包含ContainersReady、PodInitalized、PodReady
Conditions []PodCondition
// A human readable message indicating details about why the pod is in this state.
// +optional
// 描述Pod状态的当前详细情况
Message string
// A brief CamelCase message indicating details about why the pod is in this state. e.g. 'Evicted'
// +optional
// 描述Pod当前状态的原因
Reason string
// nominatedNodeName is set when this pod preempts other pods on the node, but it cannot be
// scheduled right away as preemption victims receive their graceful termination periods.
// This field does not guarantee that the pod will be scheduled on this node. Scheduler may decide
// to place the pod elsewhere if other nodes become available sooner. Scheduler may also decide to
// give the resources on this node to a higher priority pod that is created after preemption.
// +optional
// 运行的Pod可能需要重写选择一个Node运行,这个节点会被叫做“提名节点”(Nominated Node“;不为空就是在Node上运行
NominatedNodeName string
// +optional
// 分配到主机的IP,Pod没有被调度,则为空
HostIP string

// PodIPs holds all of the known IP addresses allocated to the pod. Pods may be assigned AT MOST
// one value for each of IPv4 and IPv6.
// +optional
// 保存分配的Pod的IP地址,如果设置 了这个字段,必须第0条记录与PodI匹配
// Pod最多可以为IPv4和IPv6的每周网络类型分配1个值,如果没有分配IP地址,则列表为空
PodIPs []PodIP

// Date and time at which the object was acknowledged by the Kubelet.
// This is before the Kubelet pulled the container image(s) for the pod.
// +optional
// Pod在Kubelet创建的时间
StartTime *metav1.Time
// +optional
// 记录当前Pod的服务质量等级
QOSClass PodQOSClass

// The list has one entry per init container in the manifest. The most recent successful
// init container will have ready = true, the most recently started container will have
// startTime set.
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-and-container-status
// 数组,记录每一个Init容器的状态,最近成功的Init容器设置ready=true
InitContainerStatuses []ContainerStatus
// The list has one entry per app container in the manifest.
// +optional
// 数组,记录每一个普通容器的状态
ContainerStatuses []ContainerStatus

// Status for any ephemeral containers that have run in this pod.
// +optional
// 数组,记录每一个临时容器的状态
EphemeralContainerStatuses []ContainerStatus
}

Pod的阶段

  • Pending:Pod已经被K8s接受,但是容器还没有被创建
  • Running:Pod已经绑定到一个节点上,所有的容器已经创建。至少有一个容器在运行,或者启动或重启
  • Succeeded:成功结束,Pod中所有容器结束,不会重启
  • Failed:Pod所有容器终止,至少有一个容器失败,容器以非0退出或者系统终止

容器的状态

K8s会跟踪Pod中每一个容器的状态

  • Waiting:等待,处于Waiting 依旧在运行它启动的条件
    • 可能在拉取容器,查看Reason 字段,可以查看当前的状态的原因
  • Running:运行中,容器正常运行中
    • 如果设置了postStart回调,可以观测到回调已经执行
  • Terminated:已终止,执行已经结束
    • 如果设置了preStop,在进入Terminated之前,运行

Pod的状态

PodCondition表示Pod的状态

  • ContainersReady:表示容器已经就绪,可以开始提供服务

  • PodInitialized:POd初始化已经完成,比如Pod容器镜像、网络和存储卷资源

  • PodReady:表示Pod已经就绪,可以接受流量

  • PodScheduled:表示Pod已经被K8s绑定在节点上

  • AlphaNoCompatGuaranteeDisruptionTarget:表示Pod可以容忍故障,即当需要维护宿主机或者调用框架选择容器,Pod可以容忍部分或者整个Pod丢失

  • Ref:https://github.com/kubernetes/kubernetes/blob/88e994f6bf8fc88114c5b733e09afea339bea66d/pkg/apis/core/types.go#L2421

    // These are valid conditions of pod.
    const (
    // PodScheduled represents status of the scheduling process for this pod.
    PodScheduled PodConditionType = "PodScheduled"
    // PodReady means the pod is able to service requests and should be added to the
    // load balancing pools of all matching services.
    PodReady PodConditionType = "Ready"
    // PodInitialized means that all init containers in the pod have started successfully.
    PodInitialized PodConditionType = "Initialized"
    // PodReasonUnschedulable reason in PodScheduled PodCondition means that the scheduler
    // can't schedule the pod right now, for example due to insufficient resources in the cluster.
    PodReasonUnschedulable = "Unschedulable"
    // ContainersReady indicates whether all containers in the pod are ready.
    ContainersReady PodConditionType = "ContainersReady"
    // AlphaNoCompatGuaranteeDisruptionTarget indicates the pod is about to be deleted due to a
    // disruption (such as preemption, eviction API or garbage-collection).
    // The constant is to be renamed once the name is accepted within the KEP-3329.
    AlphaNoCompatGuaranteeDisruptionTarget PodConditionType = "DisruptionTarget"
    )

生成Pod的状态

// generateAPIPodStatus creates the final API pod status for a pod, given the
// internal pod status. This method should only be called from within sync*Pod methods.
func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
klog.V(3).InfoS("Generating pod status", "pod", klog.KObj(pod))
// use the previous pod status, or the api status, as the basis for this pod
// 保持Pod上次状态
oldPodStatus, found := kl.statusManager.GetPodStatus(pod.UID)
if !found {
oldPodStatus = pod.Status
}
// 根据给定的Pod状态和来自API的Pod 先前的状态给定的Pod初始化一个API PodStatus
s := kl.convertStatusToAPIStatus(pod, podStatus, oldPodStatus)
// calculate the next phase and preserve reason
// 计算Phase
allStatus := append(append([]v1.ContainerStatus{}, s.ContainerStatuses...), s.InitContainerStatuses...)
s.Phase = getPhase(&pod.Spec, allStatus)
klog.V(4).InfoS("Got phase for pod", "pod", klog.KObj(pod), "oldPhase", oldPodStatus.Phase, "phase", s.Phase)

// Perform a three-way merge between the statuses from the status manager,
// runtime, and generated status to ensure terminal status is correctly set.
if s.Phase != v1.PodFailed && s.Phase != v1.PodSucceeded {
switch {
case oldPodStatus.Phase == v1.PodFailed || oldPodStatus.Phase == v1.PodSucceeded:
klog.V(4).InfoS("Status manager phase was terminal, updating phase to match", "pod", klog.KObj(pod), "phase", oldPodStatus.Phase)
s.Phase = oldPodStatus.Phase
case pod.Status.Phase == v1.PodFailed || pod.Status.Phase == v1.PodSucceeded:
klog.V(4).InfoS("API phase was terminal, updating phase to match", "pod", klog.KObj(pod), "phase", pod.Status.Phase)
s.Phase = pod.Status.Phase
}
}

if s.Phase == oldPodStatus.Phase {
// preserve the reason and message which is associated with the phase
s.Reason = oldPodStatus.Reason
s.Message = oldPodStatus.Message
if len(s.Reason) == 0 {
s.Reason = pod.Status.Reason
}
if len(s.Message) == 0 {
s.Message = pod.Status.Message
}
}

// check if an internal module has requested the pod is evicted and override the reason and message
// 按照顺序执行合并是那个地方,判断Pod是否还要留在这个节点上
// 如果其中一个为否,则Pod的Phase变成PodFailed,最终被节点驱逐
for _, podSyncHandler := range kl.PodSyncHandlers {
if result := podSyncHandler.ShouldEvict(pod); result.Evict {
s.Phase = v1.PodFailed
s.Reason = result.Reason
s.Message = result.Message
break
}
}

// pods are not allowed to transition out of terminal phases
if pod.Status.Phase == v1.PodFailed || pod.Status.Phase == v1.PodSucceeded {
// API server shows terminal phase; transitions are not allowed
if s.Phase != pod.Status.Phase {
klog.ErrorS(nil, "Pod attempted illegal phase transition", "pod", klog.KObj(pod), "originalStatusPhase", pod.Status.Phase, "apiStatusPhase", s.Phase, "apiStatus", s)
// Force back to phase from the API server
s.Phase = pod.Status.Phase
}
}

// ensure the probe managers have up to date status for containers
kl.probeManager.UpdatePodStatus(pod.UID, s)

// preserve all conditions not owned by the kubelet
s.Conditions = make([]v1.PodCondition, 0, len(pod.Status.Conditions)+1)
for _, c := range pod.Status.Conditions {
if !kubetypes.PodConditionByKubelet(c.Type) {
s.Conditions = append(s.Conditions, c)
}
}
// set all Kubelet-owned conditions
if utilfeature.DefaultFeatureGate.Enabled(features.PodHasNetworkCondition) {
s.Conditions = append(s.Conditions, status.GeneratePodHasNetworkCondition(pod, podStatus))
}
s.Conditions = append(s.Conditions, status.GeneratePodInitializedCondition(&pod.Spec, s.InitContainerStatuses, s.Phase))
s.Conditions = append(s.Conditions, status.GeneratePodReadyCondition(&pod.Spec, s.Conditions, s.ContainerStatuses, s.Phase))
s.Conditions = append(s.Conditions, status.GenerateContainersReadyCondition(&pod.Spec, s.ContainerStatuses, s.Phase))
s.Conditions = append(s.Conditions, v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionTrue,
})
// set HostIP and initialize PodIP/PodIPs for host network pods
if kl.kubeClient != nil {
hostIPs, err := kl.getHostIPsAnyWay()
if err != nil {
klog.V(4).InfoS("Cannot get host IPs", "err", err)
} else {
s.HostIP = hostIPs[0].String()
// HostNetwork Pods inherit the node IPs as PodIPs. They are immutable once set,
// other than that if the node becomes dual-stack, we add the secondary IP.
if kubecontainer.IsHostNetworkPod(pod) {
// Primary IP is not set
if s.PodIP == "" {
s.PodIP = hostIPs[0].String()
s.PodIPs = []v1.PodIP{{IP: s.PodIP}}
}
// Secondary IP is not set #105320
if len(hostIPs) == 2 && len(s.PodIPs) == 1 {
s.PodIPs = append(s.PodIPs, v1.PodIP{IP: hostIPs[1].String()})
}
}
}
}

return *s
}

原地升级(In-place Update)

为什么需要原地升级

  • Pod处理的负载显著增大,资源不够
  • 负载小,分配的资源没有合理利用
  • 资源分配不合理

OpenKruise

开源项目OpenKruise支持K8s拓展的AdvanceStatefulSet、CloneSet、SidecarSet的原地升级。需要把updateStragegy的type类型转换为俩种类型之一:

  • InPlaceIfPossible:如果可能,控制器尝试原地更新Pod,而不是重新创建Pod
    • 目前只有spec.template.spec.container[x].image支持原地升级
  • InPlaceOnly:原地升级,不能修改spec.tamplatespec.template.spec.container[x].image外的任何字段

updateStragegy默认值ReCreate,必须显示的设置才能原地升级

API变更

  • Pod.Spec.Containers.Resoures:变成纯粹的声明,表示Pod的期望
  • Pod.Status.ContainersStatuses[i].ResourceAllocated:(新字段,v1.ResourceList)表示分配给Pod以其容器的Node资源
  • Pod.Status.ContainersStatuses[i].Resource:(新字段,v1.Resource requirement)表示Pod以及容器持有的实际资源
  • Pod.Status.Resize:(新字段,map[string]string)类型,显示给定容器的给定资源情况

容器调整策略

ResizePlicy支持以下策略:

  • RestartNotRequired:默认值,如果可能不重启情况下调整大小
    • 不保证容器不重启
    • 可能导致容器终止
  • Restart:容器需要重启才能应用新资源

调整状态

新增Pod.Status.Resize[]是新增的一个字段,表示了kubelete是否接受或者拒绝了对给定资源的建议调整操作。

  • Pod.Spec.Containers[i].Resource.Request字段和Pod.Status.Containers[i].Resources字段不相同的时候,这个字段解释原因

这个字段可设置以下值:

  • Proposed:建议调整的大小,还没有被接受或者拒绝
  • InProgress:建议调整的大小进行进行,已经被接受
  • Deferred:延迟,理论可行,但是不是选择执行,需要重新评估
  • Infeaible:不可行,被拒绝了
  • (no value):不建议调整大小

CRI变化

kubelete调用UpdateContainerResource 调整资源配额,目前此采用runtimeapi.LinuxContainerResources,适用于Docker和Kata,不适用于Windows。