prometheus通过指标名称以及对应的一组标签唯一定义一条时间序列。指标名称反映了监控样本的基本标识,而label在基本特征上为采集到的数据提供了多种特征的维度。用户可以基于这些特征维度过滤,聚合,统计从而产生的新的计算后的一条时间序列。
PromQl是prometheus内值的数据查询语言,其提供对时间序列数据的丰富查询,聚合以及逻辑运算能力的支持。
数据类型
在表达式中,任何表达式或者子表达式都可以归为四种类型:
- instant vector瞬时向量:一组时间序列,包含每个时间序列的单个样本,所有的时间序列都共享相同的时间戳
- range vector 向量范围,一组时间序列,包含每个时间序列随时间变化的一些列数据点
- scalar标量 一个简单的浮点值
- string字符串,一个当前没有被使用的简单字符串
瞬时向量
允许在给点的时间戳(即时)为每个选择一组时间和单个样本值。支持用户根据时间序列的标签值匹配模式来对时间序列进行过滤,目前主要支持两种模式,完全匹配和正则匹配
示例:
#通过标签查询
promethesu_http_requests_total{handler="/api/v1/query}"
#排除
promethesu_http_requests_total{handler!="/api/v1/query}"
支持正则表达式作为匹配条件,多个表达式用 | 进行分离
示例:
#查询多个环节下的时间序列,
promethesu_http_requests_total{handler=~"/api/v1/query|/api/v1/series}"
范围向量
范围向量文字跟即时向量文字一样工作。语法上,范围持续时间附加在向量选择器末尾添加[],以指定每个结果范围向量元素提取多长时间值。
持续时间为指定为数字,以下是单位
- s seconds
- m miniutes
- h hours
- d days
- w weeks
- y years
示例
promethesu_http_requests_total{handler="/api/v1/query}[5m]
时间位移操作
offset提供位移操作,示例
#返回5分钟前的样本
prometheus_http_requests_total{} offset 5m
#返回昨天一天的数据样本
prometheus_http_requests_total{}[1d] offset 1d
标量浮点值
标量浮点值可以直接写成形式-[.(digtis)],标量只有一个数字,没有时许。
字符串string
直接使用字符串
操作符
数学运算
+ 加法
- 减法
* 乘法
/ 除去
% 求余数
^ 幂运算
#查询主机内存大小,返回的是bytes,转换成G
node_memory_MemToal_bytes /1024/1024/1024
#返回多块磁盘之间的读写io
node_disk_written_bytes_total + node_disk_read_bytes_total
布尔运算
支持用户根据时间序列中的样本的值,对时间序列进行过滤,经常使用在告警规则当中。
== 相等
!= 不等于
> 大于
< 小于
>= 大于等于
<= 小于等于
#查询所有主机节点的内存使用率
(node_memory_bytes_total - node_memory_free_bytes_total) / node_memory_bytes_total
#编写告警规则,内存超过95%的主机,告警
(node_memory_bytes_total - node_memory_free_bytes_total) / node_memory_bytes_total > 0.95
集合计算
支持的集合计算
and
or
unless 排除
运算符优先级
优先级顺序
*, /, %
+, -
==, !=, <=,<,>,>=, >;
and, unless
or
聚合运算
通过内置的聚合操作符号,将瞬时向量表达式返回的样本数据进行聚合,形成一个行动时间序列。
min (最小值)
max (最大值)
avg (平均值)
stddev (标准差)
stdvar (标准差异)
count (计数)
count_values (对value进行计数)
bottomk (后n条时序)
topk (前n条时序)
聚合操作语法
<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]
其中只有count_values, quantile, topk, bottomk支持参数(parameter)。
without用于从计算结果中移除列举的标签,而保留其它标签。by则正好相反,结果向量中只保留列出的标签,其余标签则移除。通过without和by可以按照样本的问题对数据进行聚合。
举例
#查询求和
sum(http_requests_total)
#查询数据的平均值:
avg(http_requests_total)
#查询最靠前的3个值:
topk(3, http_requests_total
常用函数
- increase()
配合counter数据类型使用,获取向量中的第一个和最后一个样本并返回其增长量。如果除以时间就可以获取该时间内的平均增长率。
#主机节点最近两分钟的平均cpu使用率
increase(node_cpu_seconds_total[2m]) /120
- rate()
取counter在这个时间段中平均每秒增量
rate(node_cpu_seconds_total[2m])
- sum()
实际工作中cpu 大多是多核的,而node_cpu会将每个核的数据单独显示出来,我们其实不关心每个核心的单独情况,而是关心总CPU的情况。使用sum()将机器所有的数据都进行了求和,所以还要使用by(intstance)或者by(cluster_name)取出单个服务器或者一组服务器的CPU数据。
#先找出每一一个,然后在合并
sum(increase(node_cpu_seconds_total[1m]))
- topk()
可以从大量数据中取出排行前N的数值,N可以自定义。比如监控了100台服务器的320个cpu,使用该函数就可以查看当前负载较高的那几个,用于报警。
topk(3,node_disk_io_now)
- predict_linear()
对曲线变化的速率进行计算,起到一定的预测作用。比如当前这一个小时磁盘可用率下降,这种情况可能导致磁盘很快被写满,此时使用该函数,用当前1小时的数据取预测未来几个小时的状态,实现提前告警。
#如果未来4小时后磁盘使用率为负数,则会报警
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h],4*3600)<0
常用表达式
计算cpu的使用率
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m]))by(instance)*100)
内存使用率
(node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100
磁盘使用率
100 - (node_filesystem_free_bytes{mountpoint="/",fstype=~"ext4|xfs|rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype=~"ext4|xfs|rootfs"} * 100)
主机节点cpu iowait占百分比
avg(irate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) * 100
系统1分钟负载
sum by (instance) (node_load1)
网卡流量
avg(irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))by (environment,instance,device)
告警rule
prometheus进行数据采集,存储和定制告警规则;数据展示是基于grafna,告警是由alertmanager来实现。
告警过程
在prometheus server中定义告警规则以及产生的告警,alertermanager组件用处理这些告警。
prometheus触发告警过程:
prometheus->出发告警阈值->超过持续时间->alertmanager->分组|抑制->媒体类型->邮件|钉钉|微信等
本次不涉及alertmanger配置,只涉及promtherus的rule的配置
定义告警规则rule
一条典型的告警规则如下所示:
groups:
- name: example
rules:
- alert: HighErrorRate
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
description: description info
在告警规则文件中,我们可以将一组相关的规则设置定义在一个group下。在每一个group中我们可以定义多个告警规则(rule)。一条告警规则主要由以下几部分组成:
- alert:告警规则的名称。
- expr:基于PromQL表达式告警触发条件,用于计算是否有时间序列满足该条件。
- for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
- labels:自定义标签,允许用户指定要附加到告警上的一组附加标签。
- annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations的内容在告警产生时会一同作为参数发送到Alertmanager。
为了能够让Prometheus能够启用定义的告警规则,我们需要在Prometheus全局配置文件中通过rule_files指定一组告警规则文件的访问路径,Prometheus启动后会自动扫描这些路径下规则文件中定义的内容,并且根据这些规则计算是否向外部发送通知:
rule_files:
[ - <filepath_glob> ... ]
默认情况下Prometheus会每分钟对这些告警规则进行计算,如果用户想定义自己的告警计算周期,则可以通过eval(232, 232, 232); background: rgb(249, 249, 249);">
global:
[ eval(1%25200%25200%2520-1%25200%25200)%2522%2520aria-hidden%253D%2522true%2522%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-6C%2522%2520x%253D%25220%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-61%2522%2520x%253D%2522298%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-62%2522%2520x%253D%2522828%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-65%2522%2520x%253D%25221257%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-6C%2522%2520x%253D%25221724%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-73%2522%2520x%253D%25222022%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMAIN-2E%2522%2520x%253D%25222492%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMAIN-3C%2522%2520x%253D%25222937%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-6C%2522%2520x%253D%25223993%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-61%2522%2520x%253D%25224291%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-62%2522%2520x%253D%25224821%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-65%2522%2520x%253D%25225250%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-6C%2522%2520x%253D%25225717%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-6E%2522%2520x%253D%25226015%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-61%2522%2520x%253D%25226616%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-6D%2522%2520x%253D%25227145%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMATHI-65%2522%2520x%253D%25228024%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%2520%253Cuse%2520xlink%253Ahref%253D%2522%2523E1-MJMAIN-3E%2522%2520x%253D%25228768%2522%2520y%253D%25220%2522%253E%253C%252Fuse%253E%250A%253Cg%2520transform%253D%2522translate(9825%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%258F%2598%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(10757%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E9%2587%258F%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(11690%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%258F%25AF%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(12623%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E4%25BB%25A5%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(13556%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E8%25AE%25BF%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(14489%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E9%2597%25AE%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(15422%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%25BD%2593%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(16355%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%2589%258D%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(17287%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%2591%258A%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(18220%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E8%25AD%25A6%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(19153%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%25AE%259E%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(20086%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E4%25BE%258B%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(21019%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E4%25B8%25AD%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(21952%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E6%258C%2587%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(22885%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%25AE%259A%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(23818%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E6%25A0%2587%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(24750%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E7%25AD%25BE%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(25683%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E7%259A%2584%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(26616%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E5%2580%25BC%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253Cg%2520transform%253D%2522translate(27549%252C0)%2522%253E%250A%253Ctext%2520font-family%253D%2522monospace%2522%2520stroke%253D%2522none%2522%2520transform%253D%2522scale(71.759)%2520matrix(1%25200%25200%2520-1%25200%25200)%2522%253E%25E3%2580%2582%253C%252Ftext%253E%250A%253C%252Fg%253E%250A%253C%252Fg%253E%250A%253C%252Fsvg%253E%22%2C%22id%22%3A%221658469446173%22%2C%22type%22%3A%22inline%22%7D">value则可以获取当前PromQL表达式计算的样本值。
# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}
配置prometheus的rule
定义告警文件
修改promethues的配置文件,按照如下修改
#前提安装好prometheus
#新建url目录
mkdir /usr/local/prometheus/rule
#编辑prometheus的配置文件
vim /usr/local/prometheus/prometheus.yml
意思是说读取rule目录下的yaml文件作为告警规则
告警规则定义举例1
配置举例,从简单的主机down高级处理
#vim node_monitor.yaml
groups:
- name: node-up #告警分组
rules:
- alert: node-up #监控项名称
expr: up{job="localhost-node"}==0 #正则表达式,也可以自己定义监控状态
for: 15s #for选项定义表达式的持续时长,0的话就代表一定满足
labels:
severity: critical #定义了一个标签,因为上面是基于标签路由
#annnotations: #邮件注释内存,可引用变量
# summary: "{{$labels.instance}}已经停止超过15s!"
在编写完成rule之后,可以使用prometool测试rules 文件是否正确,如果不正确可能会导致prometheus服务启动失败
./promtool check config prometheus.yml
./promtool check rules rule/node_monitor.yaml
说明:
1 prometheus支持使用变量来获取指定标签的值。比如#labels.<lablename>变量可以访问当前告警示例指定标签的的值。$value可以获取当前PromQL表达式计算的样本值。
2 在创建规则文件的时候,建议不同对象建立不同的文件,比如web.yaml,mysql.yaml等
3 expr是告警表达式编写,可以根据promql来写查询表达式
4 保存退出,重启prometheus服务,查看rule是否生效
5 测试告警规则是否生效。关闭node_exporter的监控。
systemctl stop node_exporter
状态说明
- incative: 非活动状态,表示正在监控,但是还未有任何报警触发
- Pending :表示这个警报必须触发。由于报警可以被分组,压抑/抑制或者静默/静音,所以等待验证,一旦所有验证通过,则转为firing状态
- firing 将警报发送到AlertManger,他将按照配置将警报发送给所有接收者。但是一旦警报解除,就会恢复到incative状态,如此循环
告警规则定义举例2
创建hoststats-alert.yaml
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
for: 1m
labels:
severity: page
# annotations:
# summary: "Instance {{ $labels.instance }} CPU usgae high"
# description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
for: 1m
labels:
severity: page
#annotations:
# summary: "Instance {{ $labels.instance }} MEM usgae high"
# description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
重启prometheus,查看当前加载的规则文件