调度器深度剖析

概述

Kubernetes Scheduler 负责为新创建的 Pod 选择最优的 Node，是控制平面的核心组件之一。本文档深入剖析调度器的工作原理、调度算法和源码实现。

核心架构

┌──────────────────────────────────────────────────────────────────────────────┐
│                          kube-scheduler                                      │
│                          cmd/kube-scheduler/                                  │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                      SchedulerCache (调度器缓存)                              │
│              pkg/scheduler/scheduler.go                                     │
│                                                                             │
│  - 本地缓存 API Server 中的 Node/Pod 信息                                    │
│  - 通过 ListWatch 保持同步                                                   │
│  - 支持快照机制，减少对 API Server 的压力                                    │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        SchedulingQueue (调度队列)                            │
│              pkg/scheduler/internal/queue/scheduling_queue.go               │
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                        │
│  │ ActiveQ    │  │ UnschedulQ │  │ BackoffQ    │                        │
│  │ 可调度队列  │  │ 不可调度队列 │  │ 退避队列    │                        │
│  │ (优先级排序) │  │ (等待资源释放) │  │ (重试退避)  │                        │
│  └─────────────┘  └─────────────┘  └─────────────┘                        │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                       Framework (调度框架)                                   │
│              pkg/scheduler/framework/                                       │
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
│  │  Score      │  │  Filter     │  │  Reserve    │  │  Permit     │      │
│  │  评分插件   │  │  过滤插件   │  │  预留插件   │  │  许可插件   │      │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘      │
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
│  │  PreScore   │  │  Bind      │  │  PostBind  │  │  Unreserve │      │
│  │  预评分    │  │  绑定      │  │  后置绑定  │  │  取消预留  │      │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘      │
└──────────────────────────────────────────────────────────────────────────────┘

调度周期

主循环

文件: pkg/scheduler/scheduler.go
函数: scheduleOne() (约 150 行)

┌──────────────────────────────────────────────────────────────────────────────┐
│                         从队列获取待调度 Pod                                 │
│                    schedulingQueue.Pop()                                      │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        构建调度上下文                                         │
│                    SchedulingCycle()                                        │
│                                                                             │
│  1. 基础检查                                                                │
│     ├── 检查 Pod 是否已删除                                                  │
│     ├── 检查调度周期是否超时                                                 │
│     └── 检查是否需要跳过他                                                      │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        执行调度框架                                          │
│                    schedule() -> Framework.RunFilterPlugins()                │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ 1. PreScore 阶段                                                      │ │
│  │    - 执行预评分插件                                                   │ │
│  │    - 生成评分所需数据                                                  │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ 2. Filter 阶段 (寻找可行节点)                                         │ │
│  │    - NodeResourcesFit: 检查资源是否满足                               │ │
│  │    - HostName: 检查主机名是否匹配                                    │ │
│  │    - PodMatchNodeSelector: 检查节点选择器                            │ │
│  │    - NoVolumeZoneConflict: 检查存储卷可用区                         │ │
│  │    - VolumeBinding: 检查 PVC 是否可绑定                              │ │
│  │    - MaxEBSVolumeCount: 检查 EBS 卷数量限制                         │ │
│  │    - MaxGCEPDVolumeCount: 检查 GCE PD 卷数量                        │ │
│  │    - MaxAzureDiskVolumeCount: 检查 Azure Disk 数量                   │ │
│  │    - VolumeNodeAffinity: 检查存储卷节点亲和性                        │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │ 3. Score 阶段 (为可行节点打分)                                        │ │
│  │    - NodeResourcesBalancedAllocation: 资源平衡分配                   │ │
│  │    - ImageLocality: 镜像本地性                                       │ │
│  │    - InterPodAffinity: Pod 亲和性/反亲和性                          │ │
│  │    - NodeAffinity: 节点亲和性                                        │ │
│  │    - NodePreferAvoidPods: 节点优先级偏好                             │ │
│  │    - TaintToleration: 污点容忍                                      │ │
│  └─────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        选择最优节点                                          │
│                    selectTopScoreNode()                                      │
│                                                                             │
│  1. 汇总所有 Score 结果 (加权平均)                                           │
│  2. 选择分数最高的节点                                                       │
│  3. 随机打破平局 (分数相同时)                                               │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        绑定周期                                              │
│                    bind() -> Framework.RunBindPlugins()                      │
│                                                                             │
│  1. Reserve 插件执行                                                        │
│     - VolumeBinding: 绑定 PVC 到 PV                                          │
│                                                                             │
│  2. Bind 插件执行                                                           │
│     - DefaultBinder: 更新 Pod.Spec.NodeName                                 │
│                                                                             │
│  3. PostBind 插件执行                                                       │
│     - 清理预留资源                                                          │
└──────────────────────────────────────────────────────────────────────────────┘

调度算法详解

1. Filter 阶段 (Predicates)

Filter 阶段排除不满足条件的节点，使用短路逻辑：

// pkg/scheduler/framework/plugins/helper.go
func FindNodesThatPassPlugins() ([]*v1.Node, error) {
    for _, n := range nodes {
        passed := true
        for _, p := range filterPlugins {
            status := p.Filter(ctx, pod, nodeInfo)
            if !status.IsSuccess() {
                passed = false
                break  // 短路逻辑
            }
        }
        if passed {
            feasibleNodes = append(feasibleNodes, node)
        }
    }
}

常用 Filter 规则：

规则	说明
PodFitsResources	节点有足够的 CPU/内存
PodFitsHost	主机名匹配
PodFitsHostPorts	主机端口未被占用
MatchNodeSelector	节点选择器匹配
NoVolumeZoneConflict	存储卷可用区匹配
PodToleratesNodeTaints	Pod 容忍污点
CheckNodeMemoryPressure	忽略内存压力节点
CheckNodeDiskPressure	忽略磁盘压力节点
CheckNodePIDPressure	忽略 PID 压力节点
CheckVolumeBinding	PVC 可绑定

2. Score 阶段 (Priorities)

Score 阶段为每个可行节点打分，分数范围 [0, 100]：

1 2	// 评分计算 score = sum(pluginScore * pluginWeight) / sum(weights)

内置评分插件：

插件	权重	说明
NodeResourcesBalancedAllocation	1	资源分配平衡
ImageLocality	1	镜像本地性（已在节点）
InterPodAffinity	1	Pod 亲和性/反亲和性
NodeAffinity	1	节点亲和性
NodePreferAvoidPods	10000	避免特定节点
TaintToleration	1	污点容忍
MostRequested	1	请求最多的节点

3. 资源平衡分配

// pkg/scheduler/framework/plugins/noderesources/balanced_allocation.go

func Score(_ context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) (int64, *framework.Status) {
    resourceUsage := calculateResourceUsage(nodeInfo)
    fractions := []float64{}

    for _, resource := range []v1.ResourceName{v1.ResourceCPU, v1.ResourceMemory} {
        request := resourceRequest(resource, pod)
        capacity := nodeInfo.Node().Status.Capacity[resource]
        fraction := float64(request) / float64(capacity)
        fractions = append(fractions, fraction)
    }

    // 计算平衡度 (越接近 0 越平衡)
    balance := calculateBalanceScore(fractions)

    // 返回分数 (1-10 分映射)
    return int64(balance * framework.MaxNodeScore), nil
}

4. 镜像本地性评分

// pkg/scheduler/framework/plugins/imagelocality/image_locality.go

func Score(_ context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) (int64, *framework.Status) {
    // 获取 Pod 所需镜像列表
    images := getContainerImages(pod)

    // 计算已有镜像大小
    matchedImageSize := int64(0)
    for _, image := range images {
        if size, ok := nodeInfo.ImageStates()[image]; ok {
            matchedImageSize += size
        }
    }

    // 分数映射: 0 -> 1 分, 总镜像大小 -> 10 分
    score := calculateImageLocalityScore(matchedImageSize, totalImageSize)

    return score, nil
}

调度队列

三种队列

┌──────────────────────────────────────────────────────────────────────────────┐
│                           SchedulingQueue                                    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ ActiveQ (活跃队列)                                                    │   │
│  │                                                                       │   │
│  │ - 优先级队列，按 Pod 优先级排序                                        │   │
│  │ - 高优先级 Pod 优先调度                                                │   │
│  │ - 使用堆实现 O(log n) 入队/出队                                       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ UnschedulableQ (不可调度队列)                                         │   │
│  │                                                                       │   │
│  │ - 调度失败的 Pod                                                     │   │
│  │ - 等待资源释放或条件满足                                               │   │
│  │ - 定期尝试重新调度                                                    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ BackoffQ (退避队列)                                                   │   │
│  │                                                                       │   │
│  │ - 调度失败但可能很快成功的 Pod                                         │   │
│  │ - 指数退避: 10s, 20s, 40s, 80s, ...                                │   │
│  │ - 最大退避时间: 5min                                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────────────────┘

Pod 优先级

// 优先级计算
priority = (pod.Spec.Priority != nil ? *pod.Spec.Priority : 0)

// 默认优先级类
- system-cluster-critical: 2000000000
- system-node-critical: 2000001000
- default: 0

亲和性与反亲和性

Pod 间亲和性

// pkg/scheduler/framework/plugins/interpodaffinity/filtering.go

func (pl *InterPodAffinity) Filter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) (*framework.Status, error) {
    // 1. 检查硬亲和性 (requiredDuringSchedulingIgnoredDuringExecution)
    for _, term := range pod.Spec.Affinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution {
        matchingPods := pl.getMatchingPods(term.TopologyKey, nodeInfo.Node())
        if !podMatchesAffinityTerm(pod, matchingPods, term) {
            return framework.NewStatus(framework.Unschedulable, "pod affinity not satisfied"), nil
        }
    }

    // 2. 检查硬反亲和性
    for _, term := range pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution {
        matchingPods := pl.getMatchingPods(term.TopologyKey, nodeInfo.Node())
        if podMatchesAntiAffinityTerm(pod, matchingPods, term) {
            return framework.NewStatus(framework.Unschedulable, "pod anti-affinity not satisfied"), nil
        }
    }

    return nil, nil
}

评分计算

// 软亲和性评分 (preferredDuringSchedulingIgnoredDuringExecution)
for _, term := range pod.Spec.Affinity.PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
    // 获取匹配的 Pod
    matchingPods := pl.getMatchingPods(term.TopologyKey, nodeInfo.Node())

    // 权重 * 最大匹配数 / 总命名空间数
    weight := int(term.Weight)
    score := weight * (maxMatching / totalNamespaces)
}

Volume 调度

PVC 绑定流程

┌──────────────────────────────────────────────────────────────────────────────┐
│                        VolumeBinding 插件                                    │
│                                                                             │
│  1. 检查 PVC 是否已绑定                                                    │
│     └── 已绑定 -> 检查 PV 所在节点                                          │
│                                                                             │
│  2. 动态 provisioning                                                     │
│     ├── 检查 StorageClass                                                  │
│     ├── 调用 CSI Provisioner 创建 PV                                       │
│     └── 等待 PVC 绑定                                                      │
│                                                                             │
│  3. 延迟绑定 (WaitForFirstConsumer)                                       │
│     ├── 暂不绑定 PVC                                                      │
│     └── 等 Pod 调度后再绑定                                                │
└──────────────────────────────────────────────────────────────────────────────┘

延迟绑定模式

apiVersion: v1
kind: StorageClass
metadata:
  name: delayed-binding
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-standard
volumeBindingMode: WaitForFirstConsumer  # 延迟绑定

调度优化

1. 调度器配置

# scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - name: NodeResourcesBalancedAllocation
    args:
      resources:
      - name: cpu
        weight: 1
      - name: memory
        weight: 1
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: LeastAllocated
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1

2. 多个调度器

# 创建自定义调度器
kube-scheduler --scheduler-name=custom-scheduler

# Pod 指定调度器
apiVersion: v1
kind: Pod
spec:
  schedulerName: custom-scheduler

3. 调度框架扩展

// 实现自定义调度插件
type MyFilterPlugin struct{}

func (pl *MyFilterPlugin) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) (*framework.Status, error) {
    // 自定义过滤逻辑
    if !pl.shouldSchedule(pod, nodeInfo.Node()) {
        return framework.NewStatus(framework.Unschedulable, "custom reason"), nil
    }
    return nil, nil
}

// 注册插件
func init() {
    framework.RegisterFilterPlugin("MyFilter", func(_ runtime.Object, _ framework.Handle) (framework.FilterPlugin, error) {
        return &MyFilterPlugin{}, nil
    })
}

关键代码路径

文件	说明
`cmd/kube-scheduler/scheduler.go`	调度器入口
`pkg/scheduler/scheduler.go`	调度主逻辑
`pkg/scheduler/framework/`	调度框架
`pkg/scheduler/internal/queue/`	调度队列
`pkg/scheduler/framework/plugins/`	内置插件
`pkg/scheduler/core/generic_scheduler.go`	通用调度器
`staging/src/k8s.io/client-go/`	客户端库

常见问题排查

1. Pod 一直处于 Pending

# 检查调度器日志
kubectl logs -n kube-system kube-scheduler-<node>

# 检查调度事件
kubectl describe pod <pod-name>

# 查看调度原因
kubectl get events --field-selector involvedObject.name=<pod-name>

2. Pod 调度失败

# 常见原因
# - 资源不足: NoNodeResourcesFit
# - 污点不匹配: node(s) had taints
# - 亲和性不满足: pod affinity/anti-affinity
# - 存储卷不可用: node(s) had volume node affinity conflict

3. 调度性能问题

# 启用调度追踪
kube-scheduler --v=5

# 查看调度时间
kubectl get events --field-selector reason=Scheduled

调度器指标

# 调度相关指标
scheduler_schedule_attempts_total  # 调度尝试次数
scheduler_e2e_scheduling_duration_seconds  # 端到端调度时间
scheduler_pod_scheduling_duration_seconds  # Pod 调度时间
scheduler_binding_duration_seconds  # 绑定时间
scheduler_queue_incoming_pods_total  # 入队 Pod 数

面试题

基础题

1. Kubernetes 调度器的主要职责是什么？

参考答案：

为新创建的 Pod 选择最优的 Node
协调 Pod 与 Node 的绑定关系
保证调度的公平性、效率和资源利用率

2. 调度器的工作流程是什么？

参考答案：

从调度队列获取待调度 Pod
构建调度上下文
执行 Filter 阶段（过滤不可行节点）
执行 Score 阶段（为可行节点打分）
选择最优节点
执行 Bind 绑定 Pod 到 Node

3. Filter 阶段和 Score 阶段的区别是什么？

参考答案：

阶段	作用	行为
Filter	过滤不可行节点	短路逻辑，任一失败即排除
Score	为可行节点打分	所有可行节点都打分

中级题

4. 调度队列有哪几种？各自的作用是什么？

参考答案：

ActiveQ：活跃队列，按优先级排序，高优先级先调度
UnschedulableQ：不可调度队列，存放调度失败的 Pod
BackoffQ：退避队列，指数退避重试（10s, 20s, 40s…）

5. Pod 亲和性有哪几种类型？如何工作？

参考答案：

requiredDuringSchedulingIgnoredDuringExecution：硬亲和性，必须满足
preferredDuringSchedulingIgnoredDuringExecution：软亲和性，尽量满足
反亲和性：要求/避免 Pod 调度到同一拓扑域

6. 调度器如何处理存储卷？

参考答案：

静态绑定：手动创建 PV，调度时匹配
动态绑定：StorageClass 创建 PV
延迟绑定：volumeBindingMode: WaitForFirstConsumer，Pod 调度后再绑定

7. 如何自定义调度行为？

参考答案：

NodeSelector/NodeAffinity：节点选择
PodAffinity/PodAntiAffinity：Pod 亲和性
Taints/Tolerations：污点和容忍
自定义调度器：指定 schedulerName
调度框架扩展：编写 Filter/Score 插件

高级题

8. 调度器的调度算法有哪些？各自优缺点？

参考答案：

算法	优点	缺点
BinPack	资源利用率高	碎片化，可能无法调度大 Pod
Spread	分布均匀，高可用	资源利用率低
LeastAllocated	简单	不考虑负载平衡

9. 调度器如何保证 Pod 调度的公平性？

参考答案：

使用优先级队列，高优先级 Pod 优先调度
相同优先级 Pod 轮询调度
配额控制限制租户资源使用

10. 调度器性能优化有哪些方法？

参考答案：

缓存优化：使用 SchedulerCache 减少 API Server 调用
并行化：Filter/Score 阶段并行执行
快照机制：批量获取节点信息
Pod 反亲和性优化：避免过度分散

11. 调度框架的扩展点有哪些？

参考答案：

阶段	扩展点	用途
调度前	Sort	自定义排序
过滤前	PreFilter	预处理
过滤	Filter	过滤不可行节点
评分前	PreScore	预处理评分数据
评分	Score	为节点打分
预留	Reserve	预留资源
许可	Permit	延迟绑定
绑定	Bind	执行绑定
绑定后	PostBind	后置处理
失败	Unreserve	取消预留

场景题

12. 如何实现 Pod 尽量分散在不同的可用区？

参考答案：

apiVersion: v1
kind: Pod
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: my-app
        topologyKey: topology.kubernetes.io/zone

13. 如何实现 GPU 调度？

参考答案：

节点标签：标记 GPU 类型

1	kubectl label node <node> nvidia.com/gpu=1

Pod 配置：

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    nvidia.com/gpu: "1"
  containers:
  - name: gpu-container
    resources:
      limits:
        nvidia.com/gpu: 1

14. 调度器如何处理大规模集群？

参考答案：

使用 NodeResourcesFit 快速过滤
启用 VolumeBinding 延迟绑定
配置 percentageOfNodesToScore 减少评分节点数
使用 调度器亲和性 分散调度器负载