文章图片内容均有个人书写和创作,如要引用和转载,请通知作者。
Kubernetes flannel网络模型并不复杂,但是为了网络调优,需要了解和学习很多网络细节。这边文档是一个基础调优介绍,里面包含了三个部分:
- 内核包接收流程
- flannel网络介绍
- 网络测试和优化
文中只会简要介绍网络接收的流程,主要内容会放在各层网络状态的监控以及如何优化。
内核包接收流程
这张图片展示了内核中包从网卡到用户空间的流程,大致流程如下:
- 网卡驱动被加载和初始化
- 数据包到达网卡
- 数据包通过DMA(Direct Memory Access)拷贝到内存,包的Descriptor被加入到Rx ring buffer
- 网卡驱动产生hardware interrupt,通知系统数据包已经到达memory
- 驱动调用NAPI(New API)开启poll循环来接收包到内核
- CPU在启动的时候,每一个核都会注册一个ksoftirqd进程。ksoftirqd进程通过调用NAPI poll方法,把数据包从ring buffer上取走
- 数据包被放入skb结构中,并进入网络协议栈流程
- 如果包分流(RPS/RFS)打开了,数据包会被分配到不同CPU上被处理。
- 网络数据包从队列中被放入协议栈处理
- 协议栈处理完成后,数据包被加入到sockets绑定的receive buffers中
但是从优化的角度来看,我们只需要注意三个部分:
- 包从网卡到驱动,主要针对网卡和硬件中断的优化
- 包从驱动到协议栈,主要针对NAPI流程的优化
- 包从网络协议栈到应用,主要针对协议栈相关调优项,以及socket buffer size
各个流程的监控方法,以及调优项都可以在上图中看到,具体内容可以参考《Red Hat Enterprise Linux Network Performance Tuning Guide》
flannel网络介绍
如下图展示的是,flannel网络以及数据包解析的过程。Flannel是kubernetes主流的网络模型之一。当kubernetes使用flannel创建overlay 网络时,会包含以下组件:
- 一个守护进程flanneld
- flannel.1,一个tun 网络设备,用以接收vxlan包
- cni0, 内部用的网桥,用以和pods连接
如果你用kubernetes创建一个pod,flannel会创建一个veth pair,一端在pod的网络空间内(在pod内作为eth0被看见),另外一段绑定在cni0上。
从上面可以看到,Vxlan包的解包过程大致分为五步:
- 物理网卡eno1接收到包,处理了外部MAC 及 IP包头
- 内核的IP协议栈会把包发送到UDP port :8285上的flanneld
- flanneld继续处理接收到的的UDP包,解析UDP包头以及Vxlan包头,接着基于Vxlan id,把包分发到对应的tun设备,在这里是flannel.1
- 接下来通过查询ip route table, 包通过flannel.1, cni0 转发到了pod 网络空间内的eth0.
- 最后在pod中,数据包的内部mac 头,ip 头以及UDP头被解析,并最终被发送到socket
数据包在用户空间以及内核空间的传递关系
网络测试
测试方案:
1. 带宽测试
使用iperf配合测试最大带宽,运行一个iperf server,并在各个work node上运行一个iperf client. 然后测试各个node 间带宽情况,架构如下。
测试脚本:iperf.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: iperf-server-deployment
labels:
app: iperf-server
spec:
replicas: 1
selector:
matchLabels:
app: iperf-server
template:
metadata:
labels:
app: iperf-server
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.io/role
operator: In
values:
- master
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: iperf-server
image: iperf-arm64
args: ['-s']
ports:
- containerPort: 5001
name: server
terminationGracePeriodSeconds: 0
---
apiVersion: v1
kind: Service
metadata:
name: iperf-server
spec:
selector:
app: iperf-server
ports:
- protocol: TCP
port: 5001
targetPort: server
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: iperf-clients
labels:
app: iperf-client
spec:
selector:
matchLabels:
app: iperf-client
template:
metadata:
labels:
app: iperf-client
spec:
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: iperf-client
image: iperf-arm64
command: ['/bin/sh', '-c', 'sleep infinity']
terminationGracePeriodSeconds: 0
bandwidth.sh
#!/usr/bin/env bash
set -eu
CLIENTS=$(kubectl get pods -l app=iperf-client -o name | cut -d'/' -f2)
SERVER=$(kubectl get pods -l app=iperf-server -o name | cut -d'/' -f2)
for POD in ${CLIENTS}; do
until $(kubectl get pod ${POD} -o jsonpath='{.status.containerStatuses[0].ready}'); do
echo "Waiting for ${POD} to start..."
sleep 5
done
HOSTIP=$(kubectl describe pods iperf-server|grep IP|head -1|awk '{print $2}')
kubectl exec -it ${POD} -- iperf -c iperf-server -P 3 $@
kubectl logs ${SERVER}
sleep 10
done
带宽测试优化
优化项:
- 开启RPS/RFS,可以缓解CPU性能问题
- 根据网络状况,修改MTU的大小
2. 吞吐量测试
使用iperf3测试udp包的PPS(Packet per secend)
apiVersion: apps/v1
kind: Deployment
metadata:
name: iperf3-server-deployment
labels:
app: iperf3-server
spec:
replicas: 1
selector:
matchLabels:
app: iperf3-server
template:
metadata:
labels:
app: iperf3-server
spec:
nodeName: NODE1
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.io/role
operator: In
values:
- master
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: iperf3-server
image: iperf3-arm64
args: ['-s']
ports:
- containerPort: 5201
name: server
terminationGracePeriodSeconds: 0
---
apiVersion: v1
kind: Service
metadata:
name: iperf3-server
spec:
selector:
app: iperf3-server
ports:
- protocol: TCP
port: 5201
targetPort: server
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: iperf3-clients
labels:
app: iperf3-client
spec:
selector:
matchLabels:
app: iperf3-client
template:
metadata:
labels:
app: iperf3-client
spec:
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: iperf3-client
image: iperf3-arm64
command: ['/bin/sh', '-c', 'sleep infinity']
terminationGracePeriodSeconds: 0
pps.sh
#!/bin/bash
# input the hostname of node that iperf3 client are running
TESTIP=`kubectl describe pods iperf3-server|grep IP|head -1|awk '{print $2}'`
TESTCLIENT=`kubectl describe pods iperf3-clients|grep -B 3 $1|grep "Name:"|awk '{print $2}'`
kubectl exec $TESTCLIENT -- iperf3 -c $TESTIP -u -l 46 -b 1G -t 20
优化
从NIC到驱动
- Adjusting the size of the RX/TX queues
- Hardware Interrupt coalescing
- IRQ affinity
从驱动到网络协议栈
- net.core.netdev_budget
- net.core.netdev\_budget\_usecs
- net.core.netdev\_max\_backlog
- RPS/RFS
从网络协议栈到socket
- net.core.rmem_max,
- net.core.rmem_default,
- net.ipv4.udp_mem