prometheus监控、告警与存储
# 一、kube-state-metrics
# 1.1 kube-state-metrics介绍
github地址:https://github.com/kubernetes/kube-state-metrics (opens new window)
镜像地址:https://hub.docker.com/r/bitnami/kube-state-metrics (opens new window)
博客介绍:https://xie.infoq.cn/article/9e1fff6306649e65480a96bb1 (opens new window)
kube-state-metrics是通过监听API Server生成有关资源对象的状态指标,比如Deployment、Node、Pod,需要注意的是kube-state-metrics只是简单的提供一个metrics数据,并不会存储这些指标数据,所以我们可以使用Prometheus来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如Deployment、Pod、副本状态等,调度了多少个replicas?现在可用的有几个?多少个Pod是running/stopped/terminated状态?Pod重启了多少次?目前由多少job在运行中
# 1.2 部署kube-state-metrics
- 编写基于deploy控制器的yaml文件
- 编写svc的yaml文件,端口暴露为NodePort
- 部署
# 1.3 验证数据
# 1.4 prometheus数据采集
- job_name: 'kube-state-metrics'
static_configs:
- targets: ["IP:PORT"]
2
3
k8s配置文件configmap缩进格式
# 1.5 验证prometheus状态
# 1.6 grafana导入模板
- 13824
- 14518
因为版本不同,可根据对应版本进行设置
Dashboard模板网址:https://grafana.com/grafana/dashboards/ (opens new window)
# 二、监控示例
基于第三方exporter实现对目标服务的监控
# 2.1 tomcat
- 构建镜像
github地址:https://github.com/nlighten/tomcat_exporter (opens new window)
根据tomcat官方镜像添加jar包
ADD metrics.war /data/tomcat/webapps
ADD simpleclient-0.8.0.jar /usr/local/tomcat/lib/
ADD simpleclient_common-0.8.0.jar /usr/local/tomcat/lib/
ADD simpleclient_hotspot-0.8.0.jar /usr/local/tomcat/lib/
ADD simpleclient_servlet-0.8.0.jar /usr/local/tomcat/lib/
ADD tomcat_exporter_client-0.0.12.jar /usr/local/tomcat/lib/
2
3
4
5
6
- prometheus采集
- job_name: 'kube-state-metrics'
static_configs:
- targets: ["IP:PORT"]
2
3
k8s配置文件configmap缩进格式
prometheus验证
grafana导入模板
github地址:https://github.com/nlighten/tomcat_exporter/blob/master/dashboard/example.json (opens new window)
下载这个json文件导入grafana即可
# 2.2 redis
通过redis_exporter监控redis服务装态
github网址:https://github.com/oliver006/redis_exporter (opens new window)
- 部署redis
一个pod两个容器,redis和redis-exporter
- prometheus采集
- job_name: 'redis-metrics'
static_configs:
- targets: ["IP:PORT"]
2
3
k8s配置文件configmap缩进格式
- grafana导入模板
- 14615
- 11692
# 2.3 mysql
通过mysqld_exporter监控MySQL服务的运行状态
github网址:https://github.com/prometheus/mysqld_exporter (opens new window)
- 安装mariadb-server
apt install -y mariadb
- 修改配置文件/etc/mysql/mariadb.conf.d/50-server.cnf监听地址,修改为0.0.0.0
bind-address = 0.0.0.0
重启mariadb
- 创建mysql_exporter用户
create user 'mysql_exporter'@'localhost' identified by 'password';
- 测试用户名密码连接
mysql -umysql_exporter -hlocalhost -ppassword
- 下载mysql_exporter
# 下载
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.13.0/mysqld_exporter-0.13.0.linux-amd64.tar.gz
# 解压
tar xvf mysqld_exporter-0.13.0.linux-amd64.tar.gz
# 查看启动参数
./mysqld_exporter --help
2
3
4
5
6
- 创建免密登陆文件/root/.my.cnf
cat >> /root/.my.cnf <<EOF
[client]
user=mysql_exporter
password=123321
EOF
2
3
4
5
- 创建mysqld_service文件并启动
# 创建软链接
ln -sv /apps/mysqld_exporter-0.13.0.linux-amd64 /apps/mysqld_exporter
# 创建service文件
cat >> /etc/systemd/system/mysqld_exporter.service <<EOF
[Unit]
Description=Prometheus Mysql Exporter
After=network.target
[Service]
ExecStart=/apps/mysqld_exporter/mysqld_exporter --config.my-cnf=/root/.my.cnf
[Install]
WantedBy=multi-user.target
EOF
# 重新加载配置
systemctl daemon-reload
# 启动mysqld_exporter
systemctl start mysqld_exporter.service
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
- 验证metrics
1647316276383.png
- 验证prometheus
1647316313220.png
- grafana导入模板
- 13106
- 11323
# 2.4 haproxy
通过haproxy_exporter监控haproxy
github网址:https://github.com/prometheus/haproxy_exporter (opens new window)
- 安装haproxy
apt install -y haproxy
- 修改配置文件,监听一个服务
listen SERVICE
bind BIND_IP:PORT
mode tcp
server SERVER_NAME LISTEN_IP:PORT check inter 3s fall 3 rise 3
2
3
4
确保sock文件是admin用户(level后边的admin)
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
1
- 重启haproxy
systemctl restart haproxy
检查监听端口是否正常
- 下载haproxy_exporter
# 下载
wget https://github.com/prometheus/haproxy_exporter/releases/download/v0.13.0/haproxy_exporter-0.13.0.linux-amd64.tar.gz
# 解压
tar xvf haproxy_exporter-0.13.0.linux-amd64.tar.gz
# 创建软链接
ln -sv /apps/haproxy_exporter-0.13.0.linux-amd64 /apps/haproxy_exporter
2
3
4
5
6
- 配置文件启动haproxy_expoter
./haproxy_exporter --haproxy.scrape-uri=unix:/run/haproxy/admin.sock
端口默认监听9101
- 配置haproxy状态页
listen stats
bind :PORT
stats enable
stats uri /haproxy-status
stats realm HAProxy\ Stats\ Page
stats auth haadmin:123456
stats auth admin:123456
2
3
4
5
6
7
编辑/etc/haproxy/haproxy.cfg添加如上配置内容
- 状态页启动haproxy
./haproxy_exporter --haproxy.scrape-uri="http://admin:123456@127.0.0.1:PORT/haproxy-status;csv"
需要指定用户名密码,csv是指定以csv形式展示
- 验证exporter
1647397779547.png
- prometheus数据采集
- job_name: "haproxy-exporter"
static_configs:
- targets: ["127.0.0.1:9101"]
2
3
虚拟机prometheus.yml配置缩进格式
./promtool check config prometheus.yml # 修改后检查配置文件是否正确
- 重启prometheus
systemctl restart prometheus.service
验证prometheus
grafana导入模板
- 367
- 2428
# 2.5 nginx
通过nginx_exporter监控ngix
github模块依赖网址:https://github.com/vozlt/nginx-module-vts (opens new window)
- 安装nginx
# 克隆依赖模块
git clone https://github.com/vozlt/nginx-module-vts.git
# 下载nginx源码
wget http://nginx.org/download/nginx-1.20.2.tar.gz
# 解压
tar xvf nginx-1.20.2.tar.gz
# 安装nginx编译依赖包
apt install -y libgd-dev libgeoip-dev libpcre3 libpcre3-dev libssl-dev gcc make
# 编译nginx
cd nginx-1.20.2
./configure --prefix=/apps/nginx \
--with-http_ssl_module \
--with-http_v2_module \
--with-http_realip_module \
--with-http_stub_status_module \
--with-http_gzip_static_module \
--with-pcre \
--with-file-aio \
--with-stream \
--with-stream_ssl_module \
--with-stream_realip_module \
--add-module=/usr/local/src/nginx-module-vts/
# make
make
# make install
make install
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
- 修改配置文件
http {
vhost_traffic_status_zone;
...
server {
...
location /status {
vhost_traffic_status_display;
vhost_traffic_status_display_format html;
}
}
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
参考https://github.com/vozlt/nginx-module-vts (opens new window)添加配置
- 启动nginx
# 检查配置文件
/apps/nginx/sbin/nginx -t
# 启动nginx
/apps/nginx/sbin/nginx
2
3
4
- 配置nginx的upstream
# http模块里边,server模块同级
upstream SERVICE {
server IP:PORT;
}
# server模块里边,转发首页
location / {
#root html;
#index index.html index.htm;
proxy_pass http://SERVICE;
}
2
3
4
5
6
7
8
9
10
- 检查状态页
可以以json模式显示数据
- 安装nginx_exporter
# 下载
wget https://github.com/hnlq715/nginx-vts-exporter/releases/download/v0.10.3/nginx-vts-exporter-0.10.3.linux-amd64.tar.gz
# 解压
tar xvf nginx-vts-exporter-0.10.3.linux-amd64.tar.gz
# 创建软链接
ln -sv /apps/nginx-vts-exporter-0.10.3.linux-amd64 nginx-vts-exporter
# 启动nginx_exporter
./nginx-vts-exporter -nginx.scrape_uri http://IP/status/format/json
2
3
4
5
6
7
8
默认监听端口号9913
- 验证数据
1647402416180.png
- prometheus数据采集
- job_name: "nginx-exporter"
static_configs:
- targets: ["127.0.0.1:9913"]
2
3
虚拟机prometheus.yml配置文件缩进格式
prometheus验证数据
grafana导入模板
- 2949
# 2.6 blockbox监控url
官方地址:https://prometheus.io/download/#blackbox_exporter (opens new window)
blockbox_exporter是prometheus官方提供的一个exporter,可以通过http,https,dns,tcp和icmp对被监控节点进行监控和数据采集
http/https:url/api可用性检测
TCP:端口监听检测
ICMP:主机存活检测
DNS:域名解析
- 部署blackbox_exporter
# 下载
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.19.0/blackbox_exporter-0.19.0.linux-amd64.tar.gz
# 解压
tar xvf blackbox_exporter-0.19.0.linux-amd64.tar.gz
# 创建软链接
ln -sv /apps/blackbox_exporter-0.19.0.linux-amd64 blackbox_exporter
2
3
4
5
6
- 创建blackbox-exporter.service文件
cat > /etc/systemd/system/blackbox-exporter.service <<EOF
[Unit]
Description=Prometheus Blackbox Exporter
Documentation=https://prometheus.io/download/#blackbox_exporter
After=network.target
[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/apps/blackbox_exporter/blackbox_exporter --config.file=/apps/blackbox_exporter/blackbox.yml --web.listen-address=:9115
[Install]
WantedBy=multi-user.target
EOF
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- 启动blackbox_exporter
systemctl daemon-reload
systemctl restart blackbox-exporter
2
- 验证数据
1647413347284.png
默认监听端口9115
- blackbox exporter监控url
prometheus数据采集
- job_name: "http_status"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ["domainname1","domainname2"]
labels:
instance: http_status
group: web
relabel_configs:
- source_labels: [__address__] # relbel通过将__address__(当前目标地址)写入__param_tartget标签来创建一个label
target_label: __param_target # 监控目标domainname,作为__address__的value
- source_labels: [__param_target] # 监控目标
target_label: url # 将监控目标与url创建一个label
- target_label: __address__
replacement: BLACKBOX_EXPORTER:PORT
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
虚拟机prometheus.yml配置文件缩进格式
验证prometheus状态
查看blackbox页面
1647415523065.png
- blockbox_exporter监控icmp
prometheus数据采集
- job_name: "ping_status"
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets: ["IP1","IP2"]
labels:
instance: 'ping_status'
group: 'icmp'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: ip # 将ip与__param_target创建一个label
- target_label: __address__
replacement: BLACKBOX_EXPORTER:PORT
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
虚拟机prometheus.yml配置文件缩进格式
- 验证prometheus状态
1647417886518.png
- blackbox_exporter监控端口
prometheus数据采集
# 端口监控
- job_name: "port_status"
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets: ["IP:PORT","IP:PORT"]
labels:
instance: 'port_status'
group: 'port'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: ip
- target_label: __address__
replacement: BLACKBOX_EXPORTER:PORT
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- 验证prometheus状态
1647418487723.png
- grafana导入模板
- 9965
- 13587
# 三、告警
# 3.1 Alertmanager
prometheus-->触发阈值-->超出持续时间-->alertmanager-->分组|抑制|静默-->媒体类型-->邮件|钉钉|微信等
prometheus server通过配置监控规则,实现告警发送,然后把告警push给Alertmanager,匹配Alertmanager配置的Router,以WeChat、Email或Webhook方式发送给对应的Receiver
分组(group):将类似性质的告警合并为单个通知,比如网络通知、主机通知、服务通知
静默(silences):是一种简单的特定时间静音的机制,例如:服务器要升级维护可以先设置这个时间段告警静默
抑制(inhibition):当告警发出后,停止重复发送由此告警引发的其他告警;即合并由一个故障引起的多个告警事件,可以消除冗余告警
- 安装alertmanager
# 下载
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
# 解压
tar xvf alertmanager-0.23.0.linux-amd64.tar.gz
# 创建软链接
ln -sv /apps/alertmanager-0.23.0.linux-amd64 /apps/alertmanager
2
3
4
5
6
- 创建alertmanager.service文件
cat > /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Prometheus alertmanager
After=network.target
[Service]
ExecStart=/apps/alertmanager/alertmamager --config.file="/apps/alertmanager/alertmanager.yml"
[Install]
WantedBy=multi-user.target
EOF
2
3
4
5
6
7
8
9
10
11
- 启动alertmanager
systemctl start alertmanager.service
默认监听端口9093,9094
监控配置官方网址:https://prometheus.io/docs/alerting/latest/configuration/ (opens new window)
- 验证alertmanager状态
1647481599806.png
# 3.2 邮件
官方网址:https://prometheus.io/docs/alerting/latest/configuration/#email_config (opens new window)
- 配置文件介绍
alertmanager.yml配置文件
global:
resolve_timeout: 5m # alertmanager在持续多久没有收到新告警后标记为resolved
smtp_from: # 发件人邮箱地址
smtp_smarthost: # 邮箱smtp地址
smtp_auth_username: # 发件人的登陆用户名,默认和发件人地址一致
smtp_auth_password: # 发件人的登陆密码,有时候是授权码
smtp_hello:
smtp_require_tls: # 是否需要tls协议。默认是true
route:
group_by: [alertname] # 通过alertname的值对告警进行分类
group_wait: 10s # 一组告警第一次发送之前等待的时延,即产生告警10s将组内新产生的消息合并发送,通常是0s~几分钟(默认是30s)
group_interval: 2m # 一组已发送过初始告警通知的告警,接收到新告警后,下次发送通知前等待时延,通常是5m或更久(默认是5m)
repeat_interval: 5m # 一组已经发送过通知的告警,重复发送告警的间隔,通常设置为3h或者更久(默认是4h)
receiver: 'default-receiver' # 设置告警接收人
receivers:
- name: 'default-receiver'
email_configs:
- to: 'EMAIL@DOMAIN.com'
send_resolved: true # 发送恢复告警通知
inhibit_rules: # 抑制规则
- source_match: # 源匹配级别,当匹配成功发出通知,其他级别产生告警将被抑制
severity: 'critical' # 告警时间级别(告警级别根据规则自定义)
target_match:
severity: 'warning' # 匹配目标成功后,新产生的目标告警为'warning'将被抑制
equal: ['alertname','dev','instance'] # 基于这些标签抑制匹配告警的级别
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 时间示例解析 # group_wait: 10s # 第一次产生告警,等待10s,组内有新增告警,一起发出,没有则单独发出 # group_interval: 2m # 第二次产生告警,先等待2m,2m后没有恢复就进入repeat_interval # repeat_interval: 5m # 在第二次告警时延过后,再等待5m,5m后没有恢复,就发送第二次告警
1
2
3
4如上配置,如果告警没有恢复,第二次告警会等待2m+5m,即7分钟后发出
- 配置告警规则
groups:
- name: alertmanager_pod.rules
rules:
- alert: Pod_all_cpu_usage
expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 1
for: 2m
labels:
serverity: critical
service: pods
annotations:
description: 容器 {{ $labels.name }} CPU 资源利用率大于 10% , (current value is {{ $value }})
summary: Dev CPU 负载告警
- alert: Pod_all_memory_usage
expr: sort_desc(avg by(name)(irate(node_memory_MemFree_bytes {name!=""}[5m]))) > 2147483648 # 内存大于2G
for: 2m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} Memory 资源利用大于 2G , (current value is {{ $value }})
summary: Dev Memory 负载告警
- alert: Pod_all_network_receive_usage
expr: sum by(name)(irate(container_network_reveive_bytes_total{container_name="POD"}[1m])) > 52428800 # 大于50M
for: 2m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} network_receive 资源利用大于 50M , (current value is {{ $value }})
- alert: node内存可用大小
expr: node_memory_MemFree_bytes < 4294967296 # 内存小于4G
for: 2m
labels:
severity: critical
annotations:
description: node可用内存小于4G
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
在/apps/prometheus/目录下创建rules目录,创建pods_rule.yaml文件,内容如上
注意缩进格式,如果文件格式有误,重启prometheus的时候,promethues会一直起不来,可以先用promtool检查配置文件格式,因为加载告警配置的时候,引入了这个文件,所以在检查promethues.yml文件的时候也会检查自定义的pods_rule.yaml文件
- promethues加载告警配置
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- IP:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/apps/prometheus/rules/pods_rule.yaml"
2
3
4
5
6
7
8
9
10
注:如果修改rule_files中的内容,需要先重启prometheus,加载修改后的配置,然后修改alertmanager,不然修改后的告警内容不会生效
- 重启prometheus
systemctl restart prometheus.service
- 验证prometheus状态
在prometheus页面,点击Alerts查看告警状态,当前为PENDING,说明已经检测到告警,还没满足发邮件的时间规则
1647485888388.png
FIRING证明告警已成功,此时应该已经收到邮件
查看alertmanager告警
查看告警邮件
1647485953827.png
点击Source链接,跳转的是主机名加prometheus-server的端口,无法解析就跳转不过去
- 使用amtool查看告警
./amtool alert --alertmanager.url=http://IP:9093
1647485846885.png
# 3.3 钉钉
- 钉钉添加机器人
创建机器人官方网址:https://open.dingtalk.com/document/robots/custom-robot-access (opens new window)
- 发送消息脚本
vim /data/scripts/dingding-keywords.sh
MESSAGE=$1
/usr/bin/curl -X POST 'https://oapi.dingtalk.com/robot/send?access_token=TOKEN'\
-H 'Content-Type: application/json' \
-d '{"msgtype": "text",
"text": {
"content": "${MESSAGE}"
}
}'
2
3
4
5
6
7
8
9
10
- 测试发送消息
1647506928206.png
发送的消息内容中,必须包含自定义的关键字,不然发送消息会失败,发送脚本发送消息成功后,群里会收到
- 部署webhook-dingtalk
# 下载
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
# 解压
tar xvf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
# 创建软链接
ln -sv /apps/prometheus-webhook-dingtalk-1.4.0.linux-amd64 prometheus-webhook-dingtalk
2
3
4
5
6
下载的webhook-dingtalk版本最好跟这个保持一直,新版本有些地方不兼容
- 启动webhook-dingtalk
cd /apps/prometheus-webhook-dingtalk
./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="KEYWORD=https://oapi.dingtalk.com/robot/send?access_token=TOKEN"
2
指定监听端口8060
KEYWORD必须是创建机器人时的自定义关键字,不然告警发布出去,会报错
- 配置alertmanager
- name: 'dingding'
webhook_configs:
- url: 'http://IP:8060/dingtalk/alertsen/send'
send_resolved: true
2
3
4
- 重启alertmanager
systemctl restart alertmanager.service