前提条件
ebpf-host-routing
- kernel > 5.10
因为在内核 5.11.0 以后,具备了 bpf_redict_peer()
和 bpf_redict_neigh()
这两个非常重要的 helper 函数的能力。此能力是来自 Cilium host Routing 的 Feature 引入。
eBPF Host-Routing
引入官网中的一段描述:
We introduced eBPF-based host-routing in Cilium 1.9 to fully bypass iptables and the upper host stack, and to achieve a faster network namespace switch compared to regular veth device operation. This option is automatically enabled if your kernel supports it. To validate whether your installation is running with eBPF host-routing, run
cilium status
in any of the Cilium pods and look for the line reporting the status for “Host Routing” which should state “BPF”.
简单翻译如下:
在Cilium 1.9中引入了基于eBPF的 Host Routing,可以完全绕过iptables和上层主机堆栈,并且与常规的veth设备操作相比,实现了更快的网络命名空间切换。如果你的内核支持这个选项,它会自动启用。要验证你的安装是否运行了eBPF主机路由,请在任何一个Cilium pods中运行cilium status,寻找报告 "Host Routing "状态的行,它应该显示 "BPF"。
验证环境 Host Routing 是否是 BPF
root@master:~# kubectl -n kube-system exec -it cilium-fxdkr -- cilium status | grep -i "Host Routing"
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)
Host Routing: BPF
bpf_redict_peer()
如下图所示,bpf_redict_peer()
函数可以将同 一个节点的pod 通信的步骤进行省略,报文在从源pod 出来以后,可以直接 redict(直连) 到目的pod 的内部网卡,而不会经过宿主机上与其对应的 LXC 网卡,这样在报文的路径就少了一步转发。
具体的实现原理,可以查看 cilium 官方社区的一篇博客,这种 redict 的能力,使得 cilium 绕过了 Node 上的 iptables Overhead
cni-benchmark
Node network VS container network
如图所示,在 pod 内部和 在 左边 Node 节点 的 报文封装过程 是类似的,但是 Pod 报文在离开 当前 Node 的时候,还是会在宿主机上同样处理一遍,也就是又经过了一遍的 iptables Overhead。
Container Network VS Cilium eBPF Container network
相比于左边的 标准容器网络, cilium 的 eBPF 容器网络的 Host-Routing 特性,使得左边的图的红色圈出来的部分都被跳过,从而直接返回至容器内部的网卡,这个特性使用的就是 bpf_redict_peer()
函数的能力。
eBPF host-routing 允许绕过主机命名空间中的所有的 iptables 和 上层的overhead 开销,以及穿过 Veth Pair 时的 context-switching 开销,这样网络数据包尽可能早的面向网络设备拾取,并直接传递到 Kubernetes Pod的网络命名空间,在出口端,数据包仍然需要穿过 Veth Pair,被eBPF拾取并直接提交给面向外部的网络接口(就是 宿主机eth0),路由表是直接从 eBPF 查询的,所以这种优化是完全透明,并且与系统上运行的其他提供路由分配的服务兼容。
同节点通信抓包验证
如图所示,我们按照 icmp 报文简单推测一下抓包逻辑
pod内部 veth 网卡、pod 网卡对应宿主机的 veth 网卡、宿主机网卡
- POD1:
- pod1 内veth 网卡:可以正常接收到 request 报文 和 reply 报文
- node 对应veth 网卡:可以正常接收到 request 报文,但是无法接收 reply 报文
- POD2
- pod2 内veth网卡:可以正常接收到 request 报文 和 reply 报文
- node 对应 veth网卡:可以正常接收到 reply 报文,但是无法接收 request 报文
pod 分布情况
root@master:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cni-test-76d79dfb85-5l4w9 1/1 Running 0 3d6h 10.0.0.133 <none> <none>
cni-test-76d79dfb85-7thf8 1/1 Running 0 3d6h 10.0.0.67 <none> <none>
源地址: 10.0.0.133 -> 目的地址: 10.0.0.67
pod1 网卡 与宿主机网卡对应关系
root@master:<sub># kubectl exec -it cni-test-76d79dfb85-5l4w9 -- ethtool -S eth0
NIC statistics:
peer_ifindex: 20
...
# 宿主机第 20 口 对应网卡名称
root@node-1:</sub># ip link show | grep ^20
20: lxca2865f0fc4cc@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
pod 2 网卡 与宿主机网卡对应关系
root@master:<sub># kubectl exec -it cni-test-76d79dfb85-7thf8 -- ethtool -S eth0
NIC statistics:
peer_ifindex: 22
...
# 宿主机第 22 口 对应网卡名称
root@node-1:</sub># ip link show | grep ^22
22: lxcec93fc837a06@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 100
pod1 抓包
pod1 内部网卡抓包
kubectl exec -it cni-test-76d79dfb85-5l4w9 -- tcpdump -pne -i eth0 -w pod1_lxc.cap
pod1 网卡对应 宿主机 lxc 网卡抓包
tcpdump -pne -i lxca2865f0fc4cc -w pod1_lxc_pair.cap
pod2 抓包
pod1 内部网卡抓包
kubectl exec -it cni-test-76d79dfb85-7thf8 -- tcpdump -pne -i eth0 -w pod2_lxc.cap
pod1 网卡对应 宿主机 lxc 网卡抓包
tcpdump -pne -i lxcec93fc837a06 -w pod2_lxc_pair.cap
pod1 Ping pod2
root@master:~# kubectl exec -it cni-test-76d79dfb85-5l4w9 -- ping -c 1 10.0.0.67
PING 10.0.0.67 (10.0.0.67): 56 data bytes
64 bytes from 10.0.0.67: seq=0 ttl=63 time=0.299 ms
--- 10.0.0.67 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
查看包内容:
结论验证:pod1 内 eth0 网卡:可以正常接收到 request 报文 和 reply 报文
结论验证:pod1 对应 lxc 网卡:可以正常接收到 request 报文,但是无法接收 reply 报文
结论验证:pod2 内 eth0 网卡:可以正常接收到 request 报文 和 reply 报文
结论验证:pod1 对应 lxc 网卡:可以正常接收到 reply 报文,但是无法接收 request 报文
pwru 扩展
使用 pwru 工具抓包
pwru
还是上边的源地址: 10.0.0.133 -> 目的地址: 10.0.0.67
pwru 抓包
在对应 node 节点抓包,使用源地址进行抓包
pwru --filter-src-ip 10.0.0.67 --output-tuple
2022/05/03 11:44:07 Attaching kprobes...
1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 28 p/s
pwru --filter-src-ip 10.0.0.133 --output-tuple
2022/05/03 11:43:46 Attaching kprobes...
1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 27 p/s
# ping 包测试
root@master:~# kubectl exec -it cni-test-76d79dfb85-5l4w9 -- ping -c 1 10.0.0.67
PING 10.0.0.67 (10.0.0.67): 56 data bytes
64 bytes from 10.0.0.67: seq=0 ttl=63 time=0.507 ms
--- 10.0.0.67 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.507/0.507/0.507 ms
抓包内容:
root@node-1:~# pwru --filter-src-ip 10.0.0.67 --output-tuple
2022/05/03 11:44:07 Attaching kprobes...
1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 28 p/s
2022/05/03 11:44:57 Attached (ignored 52)
2022/05/03 11:44:57 Listening for events..
SKB PROCESS FUNC TIMESTAMP
0xffff992e59159d00 [ping] ip_send_skb 32578980753945 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] ip_local_out 32578980762591 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] __ip_local_out 32578980765265 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] ip_output 32578980767478 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] nf_hook_slow 32578980770495 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] apparmor_ipv4_postroute 32578980773557 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] ip_finish_output 32578980775740 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] __cgroup_bpf_run_filter_skb 32578980778080 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] __ip_finish_output 32578980780955 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [ping] ip_finish_output2 32578980783494 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] neigh_resolve_output 32578980786897 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] __neigh_event_send 32578980789243 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] eth_header 32578980791631 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] skb_push 32578980793973 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] dev_queue_xmit 32578980796134 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] __dev_queue_xmit 32578980798539 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] netdev_core_pick_tx 32578980800876 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] validate_xmit_skb 32578980804273 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] netif_skb_features 32578980806440 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] passthru_features_check 32578980808938 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] skb_network_protocol 32578980811074 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] validate_xmit_xfrm 32578980813076 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] dev_hard_start_xmit 32578980815288 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] skb_clone_tx_timestamp 32578980818022 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] __dev_forward_skb 32578980820255 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] __dev_forward_skb2 32578980822768 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] skb_scrub_packet 32578980825038 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] eth_type_trans 32578980827462 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] netif_rx 32578980829989 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] netif_rx_internal 32578980831972 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] enqueue_to_backlog 32578980833574 10.0.0.67:0->10.0.0.133:0(icmp)
0xffff992e59159d00 [<empty>] __netif_receive_skb 32578980847697 10.0.0.67:0->10.0.0.133:0(icmp)
root@node-1:~# pwru --filter-src-ip 10.0.0.133 --output-tuple
2022/05/03 11:43:46 Attaching kprobes...
1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 27 p/s
2022/05/03 11:44:39 Attached (ignored 52)
2022/05/03 11:44:39 Listening for events..
SKB PROCESS FUNC TIMESTAMP
0xffff992e59159900 [ping] ip_send_skb 32578980483886 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] ip_local_out 32578980504988 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] __ip_local_out 32578980513027 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] ip_output 32578980516168 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] nf_hook_slow 32578980520228 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] apparmor_ipv4_postroute 32578980524494 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] ip_finish_output 32578980528041 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] __cgroup_bpf_run_filter_skb 32578980532607 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] __ip_finish_output 32578980535884 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] ip_finish_output2 32578980540962 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] neigh_resolve_output 32578980544866 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] __neigh_event_send 32578980549401 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] eth_header 32578980552201 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] skb_push 32578980554265 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] dev_queue_xmit 32578980558756 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] __dev_queue_xmit 32578980562637 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] netdev_core_pick_tx 32578980564736 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] validate_xmit_skb 32578980569790 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] netif_skb_features 32578980572386 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] passthru_features_check 32578980575187 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] skb_network_protocol 32578980577416 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] validate_xmit_xfrm 32578980580645 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] dev_hard_start_xmit 32578980583435 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] skb_clone_tx_timestamp 32578980586755 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] __dev_forward_skb 32578980589508 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] __dev_forward_skb2 32578980592276 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] skb_scrub_packet 32578980596446 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] eth_type_trans 32578980600345 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] netif_rx 32578980602431 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] netif_rx_internal 32578980604783 10.0.0.133:0->10.0.0.67:0(icmp)
0xffff992e59159900 [ping] enqueue_to_backlog 32578980607385 10.0.0.133:0->10.0.0.67:0(icmp)
2022/05/03 11:46:05 Perf event ring buffer full, dropped 3 samples
0xffff992e59159900 [ping] skb_ensure_writable 32578980650033 10.0.0.133:0->10.0.0.67:0(icmp)
cilium monitor 抓包
查看
kubectl -n kube-system exec -it cilium-qqf8s -- cilium monitor --debug -vv
kubectl exec -it cni-test-76d79dfb85-5l4w9 -- ping -c 1 10.0.0.67
查看对应 pod 的 ID
抓包内容:
关键部分
# 1. 从这里开始处理分析,这是源地址 10.0.0.133 --> 10.0.0.67,需要经过pod 内部的iptables 处理
CPU 02 MARK 0x0 FROM 183 DEBUG Conntrack lookup 1/2 src=10.0.0.133 26368 dst=10.0.0.670
CPU 02 MARK 0x0 FROM 183 DEBUG Conntrack lookup 2/2 nexthdr=1 flags=1
# 2. 在cilium中的dst地址被标识成:3352。这是cilium的一个特色。这样当pod重启ip地址变了,policy不变
CPU 02 MARK 0x0 FROM 183 DEBUG Successfully mapped addr=10.0.0.67 to identity=3352
CPU 02 MARK 0x0 FROM 183 DEBUG Conntrack create proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0
# 3. 这里注意到我们的ID变了,变成了296,变成了另一个pod
CPU 02 MARK 0x0 FROM 183 DEBUG Attempting local delivery for container id 296 from seclabel 3352
# 4. 说明我们的包,已经到了目的 pod ,目的 pod 收到了源 pod 发的 ICMP 的 request,同样需要经过pod 内部的iptables
CPU 02 MARK 0x0 FROM 296 DEBUG Conntrack lookup 1/2 src=10.0.0.133 26368 dst=10.0.0.670
CPU 02 MARK 0x0 FROM 296 DEBUG Conntrack lookup 2/2 nexthdr=1 flags=0
-----------------------------------------------------------------------------------------
# 5. 目的 pod 返回的 reply 报文经过 iptables 处理
CPU 02 MARK 0x0 FROM 296 DEBUG Conntrack lookup 1/2 src=10.0.0.67 0 dst=10.0.0.13326368
CPU 02 MARK 0x0 FROM 296 DEBUG Conntrack lookup 2/2 nexthdr=1 flags=1
CPU 02 MARK 0x0 FROM 296 DEBUG CT entry found lifetime=16809827, revnat=0
CPU 02 MARK 0x0 FROM 296 DEBUG CT verdict Reply, revnat=0
# 6.此时数据报文就会通过 bpf_redict_peer() 函数 直接 redict 到本端相同 node 的目的pod 的 eth0接口
CPU 02 MARK 0x0 FROM 296 DEBUG Successfully mapped addr=10.0.0.133 to identity=3352
CPU 02 MARK 0x0 FROM 296 DEBUG Attempting local delivery for container id 183 from seclabel 3352
# 7. 数据报文回到源 pod ,此时是经过 pod 内部的,也会经过 iptables 处理
CPU 02 MARK 0x0 FROM 183 DEBUG Conntrack lookup 1/2 src=10.0.0.67 0 dst=10.0.0.13326368
CPU 02 MARK 0x0 FROM 183 DEBUG Conntrack lookup 2/2 nexthdr=1 flags=0
总结
在启用了 eBPF Host-Routing
能力后,通过内核的 bpf_redict_peer()
函数,在同节点的 pod 通信,我们的报文可以 redict 到对端 pod 的内部,从而不经过 iptables overhead 封装,这样可以提升网络的响应速度,避免掉多次封装,影响效率。