Prometheus 监控 ingress-nginx-controller

官方文档

检查ingress是否暴露出端口

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 查看是否有内容
http://xx.xx.xx.xx:10254/metrics

# 如果没有内容,添加部分内容
vim mandatory.yaml
apiVersion: v1
kind: Deployment
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "10254"
..
spec:
ports:
- name: prometheus
containerPort: 10254
..

配置 Prometheus 配置文件

1
2
3
4
vim prometheus.yml 
- job_name: 'ingress-nginx-controller exporter'
static_configs:
- targets: ['xx.xx.xx.xx:10254']

配置 Grafana

1
2
3
wget https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/grafana/dashboards/nginx.json

# 倒入 JSON 文件

grafana 展示

编写告警信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
cd rules && vim ingress_rules.yaml 

groups:
- name: Ingress_monitor
rules:
- alert: 4xx (> 5%) HTTP 请求过多
expr: sum(rate(nginx_ingress_controller_requests{status=~"^4.."}[1m])) / sum(rate(nginx_ingress_controller_requests[1m])) * 100 >= 5
for: 1m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 4xx (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: 5xx (> 5%) HTTP 请求过多
expr: sum(rate(nginx_ingress_controller_requests{status=~"^5.."}[1m])) / sum(rate(nginx_ingress_controller_requests[1m])) * 100 >= 5
for: 1m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 5xx (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ingress-nginx 延迟高于3秒
expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Nginx latency high (instance {{ $labels.instance }})
description: "Nginx p99 latency is higher than 3 seconds\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"