Alertmanger-深入理解告警原理

AlertManager架构图

Alert Provider：API收到来自Prometheus的告警，存储到Alert Provider
- 原理是内置了Map
Dispatcher组件: 这是一个单独的goroutine，不断地通过订阅的方式从Alert Provider获取新的告警，并且根据YAML配置的Routing Tree将告警通过Label路由到不同的分组中，以实现告警信息的分组处理
Notification Pipeline组件:顾名思义，这是一个责任链模式的组件，它通过一系列逻辑(如抑制、静默、去重等)来优化告警质量。在源码中，它是通过一个个实现Stage接口的具有不同功能的实例串联起来得到的
Silence Provider组件:API层将来自Prometheus服务端的告警信息存储到Silence Provider上，然后由这个组件实现去重逻辑处理。静默规则用来关闭部分告警的通知
Notify Provider组件:它是Silence Provider组件的下游，会在本地记录日志，Peers的方式将日志广播给集群中的其他节点，判断当前节点自身或其他节点是否已经发送过了，避免告警通知在集群中重复出现

AMTool工具

什么是AmTool

AMTool是自带的官方CLient工具

安装

Ref:https://github.com/prometheus/alertmanager/tree/main/cmd/amtool

go get github.com/prometheus/alertmanager/cmd/amtool

创建使用

# 查看所有触发告警
amtool alert
# 查看具有拓展的告警
amtool -o extended alert
# 基于查询语法 查询告警
amtool -o extended alert query alertname="Test_Alert"
# 告警静默
amtool silence add alertname=Test_Alert
# 查看静默ID
amtool silence query
# 使用ID使得静默过期
amtool silence expire b3ede22e-ca14-4aa0-932c-ca2f3445f926
# 匹配所有 基于正则表达式
amtool silence query instance=~".+0"

使用AMTool创建的Silence一般在1h后自动过期，可使用–expires和–expire-on参数来指定更长的时间或者窗口

解析alertmanger的配置文件

启动参数

–web.external-url	Alertmanger绝对地址
告警规则相关参数
–rules.alert.for-outage-tolerance=1h	允许prometheus中断以恢复“for”警报状态的最长时间。
–rules.alert.for-grace-period=10m	警报和恢复的“for”状态之间的最短持续时间。这仅对配置的“for”时间大于宽限期的警报进行维护。
–rules.alert.resend-delay=1m	向Alertmanager重新发送警报之前等待的最短时间。
告警管理中心相关参数
–alertmanager.notification-queue-capacity=10000	挂起的Alertmanager通知的队列容量。	默认值：10000
–alertmanager.timeout=10s	发送告警到Alertmanager的超时时间	默认值：10s

告警配置基于Yaml格式

: 一个符合正则表达式的持续时间 ((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)，例如 1d、1h30m、5m、10s
: 一个符合正则表达式的字符串 [a-zA-Z_][a-zA-Z0-9_]*
: 一个包含 Unicode 字符的字符串
: 当前工作目录中的有效路径
: 一个布尔值，可以是 true 或 false
: 一个普通字符串
: 一个普通字符串，代表秘密，例如密码
: 在使用之前进行模板扩展的字符串
: 在使用之前进行模板扩展的字符串，但它是一个秘密
: 一个整数值
: 任何有效的 RE2 正则表达式（该正则表达式是锚定在两端的。如需去除锚定，请使用 ..

全局Global模块

global:
  # 默认的 SMTP From 头字段。
  [ smtp_from: <tmpl_string> ]
  # 用于发送邮件的默认 SMTP smarthost，包括端口号。
  # 端口号通常为 25，或者用于 TLS 的 SMTP 为 587（有时称为 STARTTLS）。
  # 示例：smtp.example.org:587
  [ smtp_smarthost: <string> ]
  # 用于识别 SMTP 服务器的默认主机名。
  [ smtp_hello: <string> | default = "localhost" ]
  # 使用 CRAM-MD5、LOGIN 和 PLAIN 的 SMTP 身份验证。如果为空，Alertmanager 不会对 SMTP 服务器进行身份验证。
  [ smtp_auth_username: <string> ]
  # 使用 LOGIN 和 PLAIN 的 SMTP 身份验证。
  [ smtp_auth_password: <secret> ]
  # 使用 LOGIN 和 PLAIN 的 SMTP 身份验证。
  [ smtp_auth_password_file: <string> ]
  # 使用 PLAIN 的 SMTP 身份验证。
  [ smtp_auth_identity: <string> ]
  # 使用 CRAM-MD5 的 SMTP 身份验证。
  [ smtp_auth_secret: <secret> ]
  # 默认的 SMTP TLS 要求。
  # 注意，Go 不支持与远程 SMTP 端点的未加密连接。
  [ smtp_require_tls: <bool> | default = true ]

  # 用于 Slack 通知的 API URL。
  [ slack_api_url: <secret> ]
  [ slack_api_url_file: <filepath> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_key_file: <filepath> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_key_file: <filepath> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]
  [ telegram_api_url: <string> | default = "https://api.telegram.org" ]
  [ webex_api_url: <string> | default = "https://webexapis.com/v1/messages" ]
  # 默认的 HTTP 客户端配置
  [ http_config: <http_config> ]

  # ResolveTimeout 是 Alertmanager 在警报中未包含 EndsAt 时使用的默认值，
  # 在此时间过去后，如果警报未更新，它可以声明该警报已解决。
  # 这对 Prometheus 的警报没有影响，因为它们总是包含 EndsAt。
  [ resolve_timeout: <duration> | default = 5m ]

# 从中读取自定义通知模板定义的文件。
# 最后一个组件可以使用通配符匹配，例如 'templates/*.tmpl'。
templates:
  [ - <filepath> ... ]

# 每个告警信息进入的根路由，用于设置告警的分发策略

route: <route>

# 通知接收者的列表
# 根路由不能有任何匹配器，因为它是所有告警的入口点。它需要配置一个 接收器，以便将不匹配任何子路由的告警发送出去。receiver的默认值为default，如果某条告警 没有被一个route匹配，发送给默认接收器
receivers:
  - <receiver> ...

# 抑制规则的列表
# 根据抑制规则，如果另一个告警正在发出，则允许对一组告警进行静默处理。
# 如果相同的告警已经处于紧急状态，我们将使用它来静默任何告警级别的通知
inhibit_rules:
  [ - <inhibit_rule> ... ]

# 已弃用：请使用下面的 time_intervals。
# 静音路由的静音时间间隔列表。
mute_time_intervals:
  [ - <mute_time_interval> ... ]

# 静音/激活路由的时间间隔列表。
time_intervals:
  [ - <time_interval> ... ]

Route模块

[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
# 将多个告警批量聚合到单个组中，这将完全禁用聚合，按原样传递所有告 警。例如，group_by: [...]
[ group_by: '[' <labelname>, ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]

# DEPRECATED: Use matchers below.
# A set of equality matchers an alert has to fulfill to match the node.
match:
  [ <labelname>: <labelvalue>, ... ]

# DEPRECATED: Use matchers below.
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers that an alert has to fulfill to match the node.
matchers:
  [ - <matcher> ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
# If omitted, child routes inherit the group_wait of the parent route.
# 当一个新的告警组被创建时，至少要等待'group_wait'时间来发送初始通知
# 可以确保有足够多的时间为同一分组获取多条告警，然后一起触发这些告警信息
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.) If omitted, child routes
# inherit the group_interval of the parent route.
# 在发送完第一条告警以后，等待group_interval时间来发送一组新的告警信息
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more). If omitted,
# child routes inherit the repeat_interval of the parent route.
# Note that this parameter is implicitly bound by Alertmanager's
# `--data.retention` configuration flag. Notifications will be resent after either
# repeat_interval or the data retention period have passed, whichever
# occurs first. `repeat_interval` should be a multiple of `group_interval`.
# 如果告警已成功发送，则等待'repeat_interval'时间重新发送它们
[ repeat_interval: <duration> | default = 4h ]

# Times when the route should be muted. These must match the name of a
# mute time interval defined in the mute_time_intervals section.
# Additionally, the root node cannot have any mute times.
# When a route is muted it will not send any notifications, but
# otherwise acts normally (including ending the route-matching process
# if the `continue` option is not set.)
mute_time_intervals:
  [ - <string> ...]

# Times when the route should be active. These must match the name of a
# time interval defined in the time_intervals section. An empty value
# means that the route is always active.
# Additionally, the root node cannot have any active times.
# The route will send notifications only when active, but otherwise
# acts normally (including ending the route-matching process
# if the `continue` option is not set).
active_time_intervals:
  [ - <string> ...]

# Zero or more child routes.
routes:
  [ - <route> ... ]
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

inhibit_rule模块

# DEPRECATED: Use target_matchers below.
# Matchers that have to be fulfilled in the alerts to be muted.
# 必须在告警中完成的匹配器被静默
target_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use target_matchers below.
target_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers that have to be fulfilled by the target
# alerts to be muted.
target_matchers:
  [ - <matcher> ... ]

# DEPRECATED: Use source_matchers below.
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
# 必须存在一个或多个告警才能使抑制生效的匹配器
source_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use source_matchers below.
source_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers for which one or more alerts have
# to exist for the inhibition to take effect.
source_matchers:
  [ - <matcher> ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
# 必须在源告警和目标告警中具有相等值的标签才能使抑制生效
[ equal: '[' <labelname>, ... ']' ]

http_config模块

http_config允许配置HTTP客户端接收器使用基于HTTP的API服务进行通信

# 注意“basic_auth”“bearer_token”和“bearer_token_file”选项是 互斥的
# 使用配置的用户名和密码设置“Authorization”标头
# password和password_file是互斥的
   basic_auth:
     [ username: <string> ]
     [ password: <secret> ]
     [ password_file: <string> ]
# 设置`Authorization` header [ bearer_token: <secret> ]
# 设置`Authorization` header
[ bearer_token_file: <filepath> ]
# 配置TLS tls_config:
     [ <tls_config> ]
# 可选的代理URL
[ proxy_url: <string>

tls_config允许配置TLS连接

# 客户端证书和密钥文件认证到服务器 [ cert_file: <filepath> ]
[ key_file: <filepath> ]
# ServerName 扩展名，指示服务器的名称
# http://tools.ietf.org/html/rfc4366#section-3.1 [ server_name: <string> ]
# 禁用服务器证书的验证
[ insecure_skip_verify: <boolean> | default = false]

Receiver模块

# 接收者唯一名称
   name: <string>
# 常见通知集成的配置 email_configs:
     [ - <email_config>, ... ]
   hipchat_configs:
     [ - <hipchat_config>, ... ]
   pagerduty_configs:
     [ - <pagerduty_config>, ... ]
   pushover_configs:
     [ - <pushover_config>, ... ]
   slack_configs:
     [ - <slack_config>, ... ]
   opsgenie_configs:
     [ - <opsgenie_config>, ... ]
   webhook_configs:
     [ - <webhook_config>, ... ]
   victorops_configs:
     [ - <victorops_config>, ... ]
   wechat_configs:
     [ - <wechat_config>, ... ]

告警规则的定义

有俩种规则：记录规则和告警规则

告警规则路经配置

rule_files:
  [ - <filepath_glob> ... ]

告警规则的撰写

ALERT <alert name>
  IF <expression>
    [ FOR <duration> ]
      [ LABELS <label set> ]
        [ ANNOTATIONS <label set> ]

基于Group，把告警规则分配到每一个组中:

groups:
- name: example
  rules:
  # 告警的名称
  - alert: HighRequestLatency
  # expr:基于PromQL表达式的告警触发条件，用于计算是否有时 间序列满足该条件
    expr: job:request_latency_seconds:mean5m{job="myjob"}
> 0.5
# 指定Prometheus服务等待的时间。该元素是活跃的且尚未 触发，表示其正处于挂起状态。该参数用于表示只有当触发条件持续 一段时间后才发送告警。在等待期间，新产生的告警的状态为 pending
    for: 10m
    # labels:允许指定额外的标签列表，并把它们附加在告警上。任 何已存在的冲突标签都会被重写。这个标签值能够被模板化。自定义 标签允许用户指定要附加到告警上的一组附加标签
    labels:
      severity: page
      # annotations:指定另一组标签，不被当作告警实例的身份来标 识。该参数经常用于存储额外的信息，例如告警描述。这个注释值能 够被模板化。这组附加信息，比如用于描述告警详细信息的文字等， 在告警产生时会一同作为参数发送到Alertmanager
    annotations:
      summary: High request latency

在告警中可以使用模板，它是一种在告警中使用时间序列数据的标签和值的方法，用于注解标签。模板使用标准的Go语法，并暴露一些包含时间序列的标签和值的变量。标签以变量$labels形式表示，指标以变量$value形式表示。比如，summary注解中可以通过{{$labels.}}和{{$value}}分别引用instance标签和时间序列的值

告警常见问题分析

告警失效

设置了for:

告警需要一直持续for等待时间，才能发送告警

未设置for,或者设置为0

告警直接从越过pending直接变成firing，告警发送

# Promethues
# 服务端抓取数据的时间间隔，默认1min
  scrape_interval
# 数据抓取的超时时间，默认10秒
  scrape_timeout
# 评估告警规则的时间间隔，默认1min
  evaluation_interval
# Alertmanager
# 发送一组新的告警初始等待时间
# 初次发告警的延迟，默认 30s
group_wait
# 初始告警如果已经发送，需要等待多长时间发送同组告警
group_interval
# 如果告警已经发送，间隔多长时间再发送，默认4h
repeat_interval

告警发送流程如下：

Prometheus以scrape_interval(如15秒)为一个采集周期，然后根据采集到的状态以evaluation_interval评估周期(如10秒)为计算表达式的计算周期(默认1分钟，定期对告警规则进行评估)
当采集对象出现问题的时候，Prometheus会持续尝试获取数据，直到scrape_timeout时间后停止尝试
表达式如果为真，告警状态切换到pending
如果持续时间超过for语句指定的时间(如10分钟)，告警状态变更为active，并将告警从Prometheus发送给Alertmanager
下个计算周期若表达式仍为真，且持续时间超过for语句指定的时间(如10秒)，则持续发送告警给Alertmanager
直到某个计算周期表达式为假，告警状态变更为inactive，发送一个resolve给Alertmanager，说明此告警已解决
Alertmanager收到告警数据后，会将告警信息进行分组，然后根据Alertmanager配置的group_wait时间先等待，在wait时间过后再发送告警信息
属于同一个告警组的告警，在等待的过程中可能产生新的告警，如果之前的告警已经成功发出，那么等待group_interval时间后再重新发送告警信息
如果告警组里的告警一直没发生变化并且已经成功发送，等待 repeat_interval时间后再重复发送相同的告警;如果之前的告警没有成功发送，需要等待group_interval时间后重复发送

告警轰炸

分组

分组(Grouping)机制是指Alertmanager将同类型的告警进行分组，合并多条告警到一个通知中

Alertmanager 可以将相似或相关的告警进行分组，避免大量重复告警的轰炸💣，让您能够更清晰地关注核心问题

抑制

Alertmanager的抑制(Inhibition)是指:当某告警已经发出时，停止重复发送由此告警引发的其他异常或者故障

静默

告警静默(Silence)提供了一个简单的机制，可以根据标签快速对告警进行静默处理。对传入的告警进行匹配检查，如果接收到的告警符合静默的配置进行静默处理