flowchart TD
A[告警流入 Alertmanager] --> B[dispatch 路由树匹配]
B --> C{匹配到哪些 Route?}
C -- 无匹配 --> Z[丢弃或默认处理]
C -- 有匹配 --> D[分组聚合(按 group_by 标签)]
D --> E[分组等待 group_wait]
E --> F[分组内聚合告警]
F --> G{到达 group_interval 或有新告警?}
G -- 否 --> F
G -- 是 --> H[触发通知管道]
H --> I[通知发送(邮件/微信/钉钉等)]
I --> J[分组生命周期管理]
J --> K{分组内还有活跃告警?}
K -- 有 --> F
K -- 无 --> L[销毁分组,释放资源]
subgraph 说明
direction LR
X1[Route: 路由规则,决定分组、接收器、静默等]
X2[group_by: 分组标签]
X3[group_wait: 首次通知等待]
X4[group_interval: 分组内通知间隔]
X5[repeat_interval: 重复通知间隔]
X1 --> X2
X1 --> X3
X1 --> X4
X1 --> X5
end
style H fill:#f9f,stroke:#333,stroke-width:2px
style I fill:#bbf,stroke:#333,stroke-width:2px
style D fill:#ffd,stroke:#333,stroke-width:1px
style F fill:#ffd,stroke:#333,stroke-width:1px
style G fill:#ffd,stroke:#333,stroke-width:1px
style J fill:#ffd,stroke:#333,stroke-width:1px
style K fill:#ffd,stroke:#333,stroke-width:1px
style L fill:#eee,stroke:#333,stroke-width:1px
style Z fill:#eee,stroke:#333,stroke-width:1px
三大参数
sequenceDiagram
participant Alert as 新告警到达
participant AM as Alertmanager
participant User as 通知接收者
Alert->>AM: 新分组第1条告警
Note right of AM: 进入 group_wait 等待期
AM-->>AM: 收敛更多同分组告警
AM->>User: group_wait 到期后首次通知
Alert->>AM: 分组内又有新告警
Note right of AM: 若距离上次通知 < group_interval,等待
AM->>User: group_interval 到期后再次通知
loop 分组内有未恢复告警
AM->>User: repeat_interval 到期后重复通知
end
graph LR
A[group_wait] -->|首次收敛窗口| B[首次通知]
B -->|新告警| C{距离上次通知}
C -- < group_interval --> D[等待]
C -- ≥ group_interval --> E[追加通知]
E -->|未恢复| F{距离上次重复}
F -- < repeat_interval --> G[等待]
F -- ≥ repeat_interval --> H[重复通知]
运行原理
启动顺序
sequenceDiagram
participant Alert as 新告警
participant Dispatcher as Dispatcher
participant Route as Route
participant GroupMap as aggrGroupsPerRoute
participant AggrGroup as aggrGroup
participant Notify as 通知管道
Alert->>Dispatcher: 新告警流入
Dispatcher->>Route: 路由树匹配
Route-->>Dispatcher: 匹配到的 Route 列表
loop 对每个匹配 Route
Dispatcher->>GroupMap: 查找/创建分组 (groupLabels)
alt 分组已存在
Dispatcher->>AggrGroup: insert(alert)
else 新分组
Dispatcher->>GroupMap: 新建 aggrGroup
Dispatcher->>AggrGroup: insert(alert)
AggrGroup->>AggrGroup: 启动 run 协程,定时通知
end
end
Note over AggrGroup: run 协程定时触发通知
AggrGroup->>Notify: flush(alerts...) 触发通知
Notify-->>AggrGroup: 通知结果
AggrGroup->>AggrGroup: 删除已解决告警
Note over Dispatcher,AggrGroup: Dispatcher 定期 doMaintenance 清理空分组
Dispatcher->>GroupMap: 检查分组是否为空
alt 分组为空
Dispatcher->>AggrGroup: stop()
Dispatcher->>GroupMap: 删除分组
end
var disp *dispatch.Dispatcher deferfunc() { disp.Stop() }() // 启动dispatcher和inhibitor主循环 disp = dispatch.NewDispatcher(alerts, routes, pipeline, marker, timeoutFunc, nil, logger, dispMetrics) routes.Walk(func(r *dispatch.Route) { if r.RouteOpts.RepeatInterval > *retention { configLogger.Warn( "repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected.", "repeat_interval", r.RouteOpts.RepeatInterval, "retention", *retention, "route", r.Key(), ) } go disp.Run() // Run 启动 Dispatcher,开始处理流入的告警。 func(d *Dispatcher) Run() { d.done = make(chanstruct{})
for { select { case alert, ok := <-it.Next(): if !ok { // 迭代器关闭,可能是数据源关闭或出错 if err := it.Err(); err != nil { d.logger.Error("Error on alert update", "err", err) } return }
d.logger.Debug("Received alert", "alert", alert)
// 记录错误但继续处理 if err := it.Err(); err != nil { d.logger.Error("Error on alert update", "err", err) continue }
now := time.Now() for _, r := range d.route.Match(alert.Labels) { d.processAlert(alert, r) } d.metrics.processingDuration.Observe(time.Since(now).Seconds())
case <-maintenance.C: d.doMaintenance() case <-d.ctx.Done(): return } } }