ebpf调研(2)-天翼云开发者社区

ebpf调研(2)

接着上次，普补充完善2个部分:

bpftrace使用以及与k8s的结合

在调研(1)简要介绍了下bpftrace，这里在深入了解一下，它类似于 DTrace 或 SystemTap，它在 eBPF 之上构建了一个简化的跟踪语言，也叫D语言，通过简单的几行脚本，就可以达到复杂的跟踪功能。多行的跟踪指令也可以放到脚本文件中执行，脚本后缀通常为 .bt。生产环境中，推荐使用 docker 来运行 bpftrace。比如:

另外，在k8s里面，为了更加方便的使用bpftrace，建议采用kubectl-trace插件，它是IOVisor开源的，安装kubectl-trace非常简单，直接在kubectl所在机器上执行。具体见:

http://github.com/iovisor/kubectl-trace

# kubectl trace is a kubectl plugin that allows you to schedule the execution of bpftrace programs in your Kubernetes cluster.

既能trace某个node，也能trace任意的container 。举例如下

# kubectl trace run $nodemame -e "tracepoint:syscalls:sys_enter_* { @[probe] = count(); }"# kubectl trace run $nodemame -f read.bt (把命令放到文件中)

# kubectl trace run -e 'uretprobe:/proc/$container_pid/exe:"main.counterValue" { printf("%d\n", retval) }' pod/$pod_name 。

可见， kubectl trace 用法和bpftrace用法基本一致，就是把bpftrace包装了一下。

注意: 当 kubectl trace 需要trace内核服务时，需要一些特权，例如基于PSP策略，但是PodSecurityPolicy 在 Kubernetes v1.21 版本中被弃用，将在 v1.25 中删除。

边缘网络eBPF超能力 (字节最佳实践)

火山引擎边缘计算在数据面也大量使用了 eBPF 及其 map 机制，并基于 eBPF 实现了 VPC 网络、负载均衡、弹性公网 IP、外网防火墙等一系列高性能、高可用的云原生网络解决方案。https://www.volcengine.com/docs/6499

虎牙基础架构团队在边缘计算方面做了很多工作，支持实时内容加工能力很好下沉到边缘，研发了边缘容器方案和边缘容器网络。为解决边缘公网抖动的问题，自研了“蜘蛛侠”虎牙 SDWAN 解决方案，建设了基于 ebpf 和 dpdk 的适应视频业务高带宽、低延时的高性能边缘网关。

内置的bcc-tool介绍

BCC 内置了一套强大的工具集，叫bcc-tool，它是基于ebpf技术开发的一套功能强大的Linux性能监视，网络等动态跟踪工具。下面简要介绍了下的它的用法。

例如

trace磁盘IO和延迟(latency)
Trace unix/tcp套接字
Trace exec()系统调用, 即使是短时系统调用。（deubg ，Trouble-shoot）
Trace on-cpu/off-cpu分布
Trace oom情况

bcc-tools 相关命令

https://github.com/iovisor/bcc

#cd /usr/share/bcc/introspection

#./bps --help

BPF Program Snapshot (bps):

List of all BPF programs loaded into the system.

Usage: bps [bpf-prog-id]

[bpf-prog-id] If specified, it shows the details info of the bpf-prog

# cd /usr/share/bcc/tools

bcc-tools工具之profile

# cd /usr/share/bcc/tools

./profile -h 参数说明	详细描述
Profile CPU stack traces at a timed interval	给CPU的栈轨迹(Stack Trace)画像(prifile)
-p PID, --pid PID profile this PID only	只追踪该pid的调用流程
-U, --user-stacks-only show stacks from user space only (no kernel space stacks)	只查看用户态调用流程，无内核态
-K, --kernel-stacks-only　　　　　　　　　　　　　　　　　　show stacks from kernel space only (no user space stacks)	只查看内核态调用流程，无用户态
-F , --frequency sample frequency, Hertz	例如： -F 99 表示按照99hz的频率进行采样，默认是采用的49hz
-c COUNT, --count COUNT　 sample period, number of events	选择采样次数 -c 5表示在周期内采样5次，-c和-F两者不能同时使用
-C CPU, --cpu CPU cpu number to run profile on	允许几个cpu运行profile程序
--stack-storage-size STACK_STORAGE_SIZE　　　　　　　　　　　　　　　　　the number of unique stack traces that can be stored and displayed (default 16384)	设置调用栈的使用空间和默认支持空间大小，unique stack traces(默认的栈是16k) unique stack traces 是什么呢？

https://www.cnblogs.com/haoxing990/p/12122372.html

./funclatency -h
Time functions and print latency as a histogram	funclatency从字面意思就可以知道其作用是获取函数的执行时延（这里的时延可不是函数被延时了多长时间，而是函数执行了多长时间），以直方图(histogram)形式展示
-p PID, --pid PID trace this PID only	指定跟踪某个进程ID各函数花费时间
-i INTERVAL, --interval INTERVAL　　　　　 summary interval, in seconds	指定执行间隔时间，以秒为单位
-d DURATION, --duration DURATION total duration of trace, in seconds	指定该程序运行多次时间。-i和-d配合使用 -i 2 -d 10 # output every 2 seconds, for duration 10s，每隔2s输出一次，持续10s
-T, --timestamp include timestamp on output	输出时间戳，即当前调用funlatency的具体时间
-u, --microseconds microsecond histogram -m, --milliseconds millisecond histogram	矩形图统计采用us的形式矩形图统计采用ms的形式
-r, --regexp use regular expressions. Default is "*" wildcards only.	正则表达式，匹配规则，监测某一类函数，只有*是通配符
./funclatency 'c:printf' # time the printf family of functions	这里c:是什么意思？
./funclatency c:read 查看用户态动态库函数read执行时间分布情况	这里c:是什么意思？
./funclatency 'clone' ./funclatency 'vfs_fstat*'	0 functions matched by "clone". Exiting. 0 functions matched by "vfs_fstat". Exiting. 追踪vfs_fstat类函数总共运行时间分布情况

https://www.cnblogs.com/haoxing990/p/12129914.html

./hardirqs -h
Summarize hard irq event time as histograms	硬中断时间时间和直方图
positional arguments: interval output interval, in seconds outputs number of outputs	输出间隔，输出次数。 ./hardirqs 1 10 # print 1 second summaries, 10 times 用每隔一秒共显示一次，共十次的形式
-T, --timestamp include timestamp on output -N, --nanoseconds output in nanoseconds -C, --count show event counts instead of timing -d, --dist show distributions as histograms	显示时间戳，用ns的形式显示不显示各中断执行时间，而是显示中断发生的次数对每一个硬中断各自采用矩形图方式显示出来
./hardirqs -NT 1	每隔1s显示一次，单位为ns，并显示时间戳

从help来看，hardirqs功能支持过于单调，甚至没法支持单独跟踪某一个硬中断的功能，这块可能让人觉得很无语，但是这么做原因是为什么呢？笔者认为主要有两个

原因:

想单独跟踪某个函数的话，采用funclatency足以。
hardirqs相比采用perf跟踪中断情况更加轻量高效。

https://www.cnblogs.com/haoxing990/p/12130344.html

./softirqs -h
Summarize soft irq event time as histograms.	softirqs顾名思义，用于跟踪软中断事件，主要用于软中断处理时延的跟踪。和 hardirqs 用法一样，这里省略

https://www.cnblogs.com/haoxing990/p/12153979.html

在使用funcslower这个函数时，发现其功能相比funclatency对时间更加具体化，对于想知道某个函数执行时间分布情况时，采用funclatency是个不错的选择，但是对于急切想知道系统中某个函数被执行的时间时,使用funcslower再方便不过了。

funcslower字面意思，函数中的慢者，简而言之，查看哪些函数调用执行时间超过了某个设定值。这个功能可能会是你在多种系统分析工具都无效后，作为最后的诊断手段，用于系统性能问题分析了。

./funcslower -h
Trace slow kernel or user function calls.	追踪 kernel 和 user 慢函数调用
-m MIN_MS, --min-ms MIN_MS minimum duration to trace (ms) -u MIN_US, --min-us MIN_US minimum duration to trace (us)	# 执行时间低于xx ms，作为阈值# 执行时间长于xx us，作为阈值
-U, --user-stack show stacks from user space -K, --kernel-stackshow stacks from kernel space	#显示用户态栈调用信息 #显示内核态栈调用信息
-f print output in folded stack format.	# 个人认为是用于生成火焰图时可以考虑
-a ARGUMENTS, --arguments ARGUMENTS print this many entry arguments, as hex	#打印输入参数，按照十进制的方式
./funcslower __kmalloc -a 2 -u 1	显示执行__kmalloc时间操作1us的进程，并且打印出每一个进程传入该函数的前两个参数值。
./funcslower read -a 3 -u 1	cannot attach kprobe, probe entry may not exist Exception: Failed to attach BPF program trace_0 to kprobe read
./funcslower c:read -a 3 -u 1	这样才对，用c:read 还不知道why?

https://www.cnblogs.com/haoxing990/p/12159247.html

./funccount -h
Count functions, tracepoints, and USDT probes user-mode statically defined traces (USDT)	从funccount的字面意思可以看出，其作用在于统计函数被调用的次数。
-p PID, --pid PID trace this PID only	仅仅跟踪某个进程调用情况
-i INTERVAL, --interval INTERVAL summary interval,seconds	每间隔多长时间打印一次跟踪结果，单位为s
-d DURATION, --duration DURATION total duration of trace, seconds	跟踪持续多长时间，单位为s
-r, --regexp use regular expressions. Default is "*" wildcards	对于跟踪某一类的函数，匹配规则可以采用*符号
./funccount c:malloc # count all malloc() calls in libc ./funccount go:os.* # count all "os.*" calls in libgo	c:代表c程序 go:代表go程序
-D, --debug print BPF program before starting (for debugging purposes)	# 显示跟踪调试信息, 程序源码

https://www.cnblogs.com/haoxing990/p/12178807.html

./runqlat -h
Summarize run queue (scheduler) latency as a histogram	runplat的作用在调度性能分析这块十分的重要，其作用是主要检测一个tasks从运行队列中到运行需要等待的时延。
positional arguments:interval output interval, in seconds count number of outputs	#输出间隔时间，单位秒 #输出次数
./runqlen -h
Summarize scheduler run queue length as a histogram	runqlen从字面意思，很简单的看出其是统计运行队列的长度输出结果中 count 是什么意思呢？

https://www.cnblogs.com/haoxing990/p/12203742.html

pidpersec从字面意思来了，就是个每秒pid产生的数目。这个家伙功能较为单一，做用就是统计每秒通过fork产生的pid数目。这家伙没有help功能，所以其功能十分单一。

先了解什么是off-cpu

https://www.cnblogs.com/haoxing990/p/12203997.html

On-CPU: where threads are spending time running on-CPU.

Off-CPU: where time is spent waiting while blocked on I/O, locks, timers, paging/swapping, etc.

从上面的意思基本上了解offcputime的意思是什么了:用于测量某一进程被阻塞的时间。

./offcputime -h
Summarize off-CPU time by stack trace	测量某一进程被阻塞的时间
positional arguments: duration duration of trace, in seconds	持续时间，单位秒
-p PID, --pid PID trace this PID only -t TID, --tid TID trace this TID only	#仅仅跟踪某一进程阻塞时间 #仅仅跟踪某一线程阻塞时间
-u, --user-threads-onlyuser threads only (no kernel threads) -k, --kernel-threads-onlykernel threads only (no user threads)	#仅仅跟踪用户态而非内核态线程阻塞时间#仅仅跟踪内核态线程阻塞时间
-U, --user-stacks-onlyshow stacks from user space only (no kernel space stacks) -K, --kernel-stacks-only show stacks from kernel space only (no user space stacks)	#仅仅显示用户态调用栈关系 #仅仅显示内核态调用栈关系
-f, --folded output folded format	# 采用折叠模式输出，也就是按照行的形式输出
-m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME the amount of time in microseconds over which we store traces (default 1) -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME the amount of time in microseconds under which we store traces (default U64_MAX)	#只打印阻塞时间不小于xxx us的进程情况 #只打印阻塞时间不大于xxx us的进程情况只打印 -m <= x <= -M之间的。单位是us

pcstat( page cache stat )页缓存统计命令

ebpf调研(2)

接着上次，普补充完善2个部分:

bpftrace使用以及与k8s的结合

http://github.com/iovisor/kubectl-trace

# kubectl trace is a kubectl plugin that allows you to schedule the execution of bpftrace programs in your Kubernetes cluster.

既能trace某个node，也能trace任意的container 。举例如下

# kubectl trace run $nodemame -e "tracepoint:syscalls:sys_enter_* { @[probe] = count(); }"# kubectl trace run $nodemame -f read.bt (把命令放到文件中)

# kubectl trace run -e 'uretprobe:/proc/$container_pid/exe:"main.counterValue" { printf("%d\n", retval) }' pod/$pod_name 。

可见， kubectl trace 用法和bpftrace用法基本一致，就是把bpftrace包装了一下。

注意: 当 kubectl trace 需要trace内核服务时，需要一些特权，例如基于PSP策略，但是PodSecurityPolicy 在 Kubernetes v1.21 版本中被弃用，将在 v1.25 中删除。

边缘网络eBPF超能力 (字节最佳实践)

内置的bcc-tool介绍

BCC 内置了一套强大的工具集，叫bcc-tool，它是基于ebpf技术开发的一套功能强大的Linux性能监视，网络等动态跟踪工具。下面简要介绍了下的它的用法。

例如

trace磁盘IO和延迟(latency)
Trace unix/tcp套接字
Trace exec()系统调用, 即使是短时系统调用。（deubg ，Trouble-shoot）
Trace on-cpu/off-cpu分布
Trace oom情况

bcc-tools 相关命令

https://github.com/iovisor/bcc

#cd /usr/share/bcc/introspection

#./bps --help

BPF Program Snapshot (bps):

List of all BPF programs loaded into the system.

Usage: bps [bpf-prog-id]

[bpf-prog-id] If specified, it shows the details info of the bpf-prog

# cd /usr/share/bcc/tools

bcc-tools工具之profile

# cd /usr/share/bcc/tools

./profile -h 参数说明	详细描述
Profile CPU stack traces at a timed interval	给CPU的栈轨迹(Stack Trace)画像(prifile)
-p PID, --pid PID profile this PID only	只追踪该pid的调用流程
-U, --user-stacks-only show stacks from user space only (no kernel space stacks)	只查看用户态调用流程，无内核态
-K, --kernel-stacks-only　　　　　　　　　　　　　　　　　　show stacks from kernel space only (no user space stacks)	只查看内核态调用流程，无用户态
-F , --frequency sample frequency, Hertz	例如： -F 99 表示按照99hz的频率进行采样，默认是采用的49hz
-c COUNT, --count COUNT　 sample period, number of events	选择采样次数 -c 5表示在周期内采样5次，-c和-F两者不能同时使用
-C CPU, --cpu CPU cpu number to run profile on	允许几个cpu运行profile程序
--stack-storage-size STACK_STORAGE_SIZE　　　　　　　　　　　　　　　　　the number of unique stack traces that can be stored and displayed (default 16384)	设置调用栈的使用空间和默认支持空间大小，unique stack traces(默认的栈是16k) unique stack traces 是什么呢？

https://www.cnblogs.com/haoxing990/p/12122372.html

./funclatency -h
Time functions and print latency as a histogram	funclatency从字面意思就可以知道其作用是获取函数的执行时延（这里的时延可不是函数被延时了多长时间，而是函数执行了多长时间），以直方图(histogram)形式展示
-p PID, --pid PID trace this PID only	指定跟踪某个进程ID各函数花费时间
-i INTERVAL, --interval INTERVAL　　　　　 summary interval, in seconds	指定执行间隔时间，以秒为单位
-d DURATION, --duration DURATION total duration of trace, in seconds	指定该程序运行多次时间。-i和-d配合使用 -i 2 -d 10 # output every 2 seconds, for duration 10s，每隔2s输出一次，持续10s
-T, --timestamp include timestamp on output	输出时间戳，即当前调用funlatency的具体时间
-u, --microseconds microsecond histogram -m, --milliseconds millisecond histogram	矩形图统计采用us的形式矩形图统计采用ms的形式
-r, --regexp use regular expressions. Default is "*" wildcards only.	正则表达式，匹配规则，监测某一类函数，只有*是通配符
./funclatency 'c:printf' # time the printf family of functions	这里c:是什么意思？
./funclatency c:read 查看用户态动态库函数read执行时间分布情况	这里c:是什么意思？
./funclatency 'clone' ./funclatency 'vfs_fstat*'	0 functions matched by "clone". Exiting. 0 functions matched by "vfs_fstat". Exiting. 追踪vfs_fstat类函数总共运行时间分布情况

https://www.cnblogs.com/haoxing990/p/12129914.html

./hardirqs -h
Summarize hard irq event time as histograms	硬中断时间时间和直方图
positional arguments: interval output interval, in seconds outputs number of outputs	输出间隔，输出次数。 ./hardirqs 1 10 # print 1 second summaries, 10 times 用每隔一秒共显示一次，共十次的形式
-T, --timestamp include timestamp on output -N, --nanoseconds output in nanoseconds -C, --count show event counts instead of timing -d, --dist show distributions as histograms	显示时间戳，用ns的形式显示不显示各中断执行时间，而是显示中断发生的次数对每一个硬中断各自采用矩形图方式显示出来
./hardirqs -NT 1	每隔1s显示一次，单位为ns，并显示时间戳

原因:

想单独跟踪某个函数的话，采用funclatency足以。
hardirqs相比采用perf跟踪中断情况更加轻量高效。

https://www.cnblogs.com/haoxing990/p/12130344.html

./softirqs -h
Summarize soft irq event time as histograms.	softirqs顾名思义，用于跟踪软中断事件，主要用于软中断处理时延的跟踪。和 hardirqs 用法一样，这里省略

https://www.cnblogs.com/haoxing990/p/12153979.html

./funcslower -h
Trace slow kernel or user function calls.	追踪 kernel 和 user 慢函数调用
-m MIN_MS, --min-ms MIN_MS minimum duration to trace (ms) -u MIN_US, --min-us MIN_US minimum duration to trace (us)	# 执行时间低于xx ms，作为阈值# 执行时间长于xx us，作为阈值
-U, --user-stack show stacks from user space -K, --kernel-stackshow stacks from kernel space	#显示用户态栈调用信息 #显示内核态栈调用信息
-f print output in folded stack format.	# 个人认为是用于生成火焰图时可以考虑
-a ARGUMENTS, --arguments ARGUMENTS print this many entry arguments, as hex	#打印输入参数，按照十进制的方式
./funcslower __kmalloc -a 2 -u 1	显示执行__kmalloc时间操作1us的进程，并且打印出每一个进程传入该函数的前两个参数值。
./funcslower read -a 3 -u 1	cannot attach kprobe, probe entry may not exist Exception: Failed to attach BPF program trace_0 to kprobe read
./funcslower c:read -a 3 -u 1	这样才对，用c:read 还不知道why?

https://www.cnblogs.com/haoxing990/p/12159247.html

./funccount -h
Count functions, tracepoints, and USDT probes user-mode statically defined traces (USDT)	从funccount的字面意思可以看出，其作用在于统计函数被调用的次数。
-p PID, --pid PID trace this PID only	仅仅跟踪某个进程调用情况
-i INTERVAL, --interval INTERVAL summary interval,seconds	每间隔多长时间打印一次跟踪结果，单位为s
-d DURATION, --duration DURATION total duration of trace, seconds	跟踪持续多长时间，单位为s
-r, --regexp use regular expressions. Default is "*" wildcards	对于跟踪某一类的函数，匹配规则可以采用*符号
./funccount c:malloc # count all malloc() calls in libc ./funccount go:os.* # count all "os.*" calls in libgo	c:代表c程序 go:代表go程序
-D, --debug print BPF program before starting (for debugging purposes)	# 显示跟踪调试信息, 程序源码

https://www.cnblogs.com/haoxing990/p/12178807.html

./runqlat -h
Summarize run queue (scheduler) latency as a histogram	runplat的作用在调度性能分析这块十分的重要，其作用是主要检测一个tasks从运行队列中到运行需要等待的时延。
positional arguments:interval output interval, in seconds count number of outputs	#输出间隔时间，单位秒 #输出次数
./runqlen -h
Summarize scheduler run queue length as a histogram	runqlen从字面意思，很简单的看出其是统计运行队列的长度输出结果中 count 是什么意思呢？

https://www.cnblogs.com/haoxing990/p/12203742.html

先了解什么是off-cpu

https://www.cnblogs.com/haoxing990/p/12203997.html

On-CPU: where threads are spending time running on-CPU.

Off-CPU: where time is spent waiting while blocked on I/O, locks, timers, paging/swapping, etc.

从上面的意思基本上了解offcputime的意思是什么了:用于测量某一进程被阻塞的时间。

./offcputime -h
Summarize off-CPU time by stack trace	测量某一进程被阻塞的时间
positional arguments: duration duration of trace, in seconds	持续时间，单位秒
-p PID, --pid PID trace this PID only -t TID, --tid TID trace this TID only	#仅仅跟踪某一进程阻塞时间 #仅仅跟踪某一线程阻塞时间
-u, --user-threads-onlyuser threads only (no kernel threads) -k, --kernel-threads-onlykernel threads only (no user threads)	#仅仅跟踪用户态而非内核态线程阻塞时间#仅仅跟踪内核态线程阻塞时间
-U, --user-stacks-onlyshow stacks from user space only (no kernel space stacks) -K, --kernel-stacks-only show stacks from kernel space only (no user space stacks)	#仅仅显示用户态调用栈关系 #仅仅显示内核态调用栈关系
-f, --folded output folded format	# 采用折叠模式输出，也就是按照行的形式输出
-m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME the amount of time in microseconds over which we store traces (default 1) -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME the amount of time in microseconds under which we store traces (default U64_MAX)	#只打印阻塞时间不小于xxx us的进程情况 #只打印阻塞时间不大于xxx us的进程情况只打印 -m <= x <= -M之间的。单位是us

pcstat( page cache stat )页缓存统计命令

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

ebpf调研(2)

ebpf调研(2)

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

ebpf调研(2)

ebpf调研(2)