K8s 里我的容器到底用了多少内存？ - 极术社区

导语:
Linux 下开发者习惯在物理机或者虚拟机环境下使用 top 和 free 等命令查看机器和进程的内存使用量，近年来越来越多的应用服务完成了微服务容器化改造，过去查看、监控和定位内存使用量的方法似乎时常不太奏效。如果你的应用程序刚刚迁移到 K8s 中，经常被诸如以下问题所困扰：容器的内存使用率为啥总是接近 99%？malloc/free 配对没问题，内存使用量却一直上涨？内存使用量超过了限制量却没有被 OOM Kill? 登录容器执行 top,free 看到的输出和平台监控视图完全对不上？... 本文假设读者熟悉 Linux 环境，拥有常见后端开发语言(C/C++ /Go/Java 等)使用经验，希望后面的内容能在读者面临此类疑惑时提供一些有效思路。

K8s 中监控数据主要来源是 cadvisor, 容器内存使用量的相关指标有以下：

这些指标究竟是什么含义？在不同的应用场景下需要重点关注哪些指标？让我们从回顾 linux 进程地址空间开始，逐步挖掘容器内存使用奥秘。

1、进程是怎么分配内存的？

回忆一下 linux 进程虚拟地址空间分布图。

+-------------------------------+ 0xFFFFFFFFFFFFFFFF (64-bit)  
|       Kernel Space            |  内核空间，用于操作系统内核和内核模块  
+-------------------------------+ 0x00007FFFFFFFFFFF  
|       User Space              |  用户空间进程的虚拟地址空间  
|       Shared Libraries        |  动态加载的共享库 (.so 文件)  
+-------------------------------+ 0x00007FFFC0000000  
|       Heap                    |  动态内存分配区域 (malloc, calloc, realloc)  
|       (malloc, etc.)          |  堆的大小可以动态增长或收缩  
+-------------------------------+ 0x00007FFFB0000000  
|       BSS Segment             |  未初始化的全局变量和静态变量  
|       (Uninitialized Data)    |  在程序启动时被初始化为零  
+-------------------------------+ 0x00007FFFA0000000  
|       Data Segment            |  已初始化的全局变量和静态变量  
|       (Initialized Data)      |  在程序启动时被初始化为特定的值  
+-------------------------------+ 0x00007FFF90000000  
|       Text Segment            |  可执行代码段  
|       (Code)                  |  通常是只读的，以防止代码被意外修改  
+-------------------------------+ 0x00007FFF80000000  
|       Stack                   |  用于存储函数调用的局部变量、参数和返回地址  
|                               |  栈通常从高地址向低地址增长  
+-------------------------------+ 0x0000000000000000

在 linux 内核里描述上述图的结构是mm_struct，它还可以展开得更详细:

| task_struct (/bin/gonzo)      |  
|                               |  
|   mm                          |  
|   |                           |  
|   v                           |  
| +---------------------------+ |  
| | mm_struct                 | |  
| |                           | |  
| |   mmap                    | |  
| |   |                       | |  
| |   v                       | |  
| | +-----------------------+ | |  
| | | vm_area_struct        | | |  
| | | VM_READ | VM_EXEC     | | |  
| | |-----------------------| | |  
| | | Text (file-backed)    | | |  
| | +-----------------------+ | |  
| |   |                       | |  
| |   v                       | |  
| | +-----------------------+ | |  
| | | vm_area_struct        | | |  
| | | VM_READ | VM_WRITE    | | |  
| | |-----------------------| | |  
| | | Data (file-backed)    | | |  
| | +-----------------------+ | |  
| |   |                       | |  
| |   v                       | |  
| | +-----------------------+ | |  
| | | vm_area_struct        | | |  
| | | VM_READ | VM_WRITE    | | |  
| | |-----------------------| | |  
| | | BSS (anonymous)       | | |  
| | +-----------------------+ | |  
| |   |                       | |  
| |   v                       | |  
| | +-----------------------+ | |  
| | | vm_area_struct        | | |  
| | | VM_READ | VM_WRITE    | | |  
| | |-----------------------| | |  
| | | Heap (anonymous)      | | |  
| | +-----------------------+ | |  
| |   |                       | |  
| |   v                       | |  
| | +-----------------------+ | |  
| | | vm_area_struct        | | |  
| | | VM_READ | VM_EXEC     | | |  
| | |-----------------------| | |  
| | | Memory mapping        | | |  
| | +-----------------------+ | |  
| |   |                       | |  
| |   v                       | |  
| | +-----------------------+ | |  
| | | vm_area_struct        | | |  
| | | VM_READ | VM_WRITE    | | |  
| | | VM_GROWS_DOWN         | | |  
| | |-----------------------| | |  
| | | Stack (anonymous)     | | |  
| | +-----------------------+ | |  
| +---------------------------+ |+-------------------------------+

可以发现，linux 进程地址空间是由一个个 vm_area_struct(vma)组成，每个 vma 都有自己地址区间。如果你的代码 panic 或者 Segmentation Fault 崩溃，最直接的原因就是你引用的指针值不在进程的任意一个 vma 区间内。你可以通过 /proc/<pid>/maps 来观察进程的 vma 分布。

1.1 malloc 分配内存

malloc 函数增大了进程虚拟地址空间的 heap 容量，扩大了 mm 描述符中 vma 的 start 和 end 长度，或者插入了新的 vma；但是它刚完成调用后，并没有增大进程的实际内存使用量。

以下是个代码示例证明上述言论。

#include <stdlib.h>  
#include <unistd.h>  
#include <string.h>  
#include <sys/resource.h>  
#include <stdio.h>  
#include <time.h>

const int64_t GB = 1024 * 1024 * 1024;   
const int64_t MB = 1024 \* 1024;   
const int64_t KB = 1024;

void max_rss() {  
    struct rusage r_usage;  
    getrusage(RUSAGE_SELF, &r_usage);  
    printf("Current max rss %ld kb, pagefault minor %ld, major %ld\n",   
        r_usage.ru_maxrss, r_usage.ru_minflt, r_usage.ru_majflt);  
}

int main() {  
    printf("Pid %lu\n", getpid());  
    int number = 128;  
    void *ptr =  malloc(number * MB);  
    if (ptr == 0) {  
        printf("Out of memory\n");  
        exit(EXIT_FAILURE);  
    }  
    printf("Allocated %d MB memory by malloc(3), ptr %p\n", number, ptr);  
    max_rss();  
    sleep(60);  
    memset(ptr, 0, number \* MB);  
    printf("Used %d MB memory by memset(3)\n", number);  
    max_rss();  
    sleep(60);  
    free(ptr);  
    printf("Memory ptr %p freed by free(3)\n", ptr);  
    max_rss();  
    sleep(60);  
    return 0;  
}

可见输出：

Pid 932451  
Allocated 128 MB memory by malloc(3), ptr 0x7f3e6cdff010  
Current max rss 3800 kb, pagefault minor 122, major 0  
Used 128 MB memory by memset(3)  
Current max rss 132732 kb, pagefault minor 187, major 0  
Memory ptr 0x7f3e6cdff010 freed by free(3)Current max rss 132732 kb, pagefault minor 187, major 0

阶段总结 1

当 memset 128MB 长度的数据完成后，我们立刻观察到进程发生了 32768 次 minor pagefault, 同时 RSS 内存占用提升到 129MB。
注意 32768 * 4096 正好等于 128MB，而 4096 正好是 linux page 默认大小。可以在程序 sleep 的时段用 top 观察监控统计进一步证实结论。

进一步说，malloc 申请到的地址，在得到真实的使用之前，必须经历缺页中断，完成建立虚拟地址到物理地址的映射。完成物理页分配的虚拟地址空间才会被计算到内存使用量中。

2 container_memory_rss

2.1 进程的 RSS

进程的 RSS(Resident Set Size)是当前使用的实际物理内存大小，包括代码段、堆、栈和共享库等所使用的内存, 实际上就是页表中物理页部分的全部大小。

更精确地说，根据内核的 get_mm_rss, RSS 由 FilePages, AnnoPages 和 ShmemPages 组成。

以下是一个例子，分别展示了这三种内存的申请和使用方式，FilePages， AnnoPages 和 ShmemPages 分别为 4MiB, 8MiB 和 10MiB，供给 22MiB.

#include <stdio.h>  
#include <stdlib.h>  
#include <sys/mman.h>  
#include <sys/shm.h>  
#include <fcntl.h>  
#include <unistd.h>  
#include <string.h>

#define FILE_SIZE (4 * 1024 * 1024) // 4 MiB  
#define ANON_SIZE (8 * 1024 * 1024) // 8 MiB  
#define SHM_SIZE (10 * 1024 * 1024) // 10 MiB

void allocate_filepages() {  
    int fd = open("tempfile", O_RDWR | O_CREAT | O_TRUNC, 0600);  
    if (fd == -1) {  
        perror("open");  
        exit(EXIT_FAILURE);  
    }

if (ftruncate(fd, FILE_SIZE) == -1) {  
        perror("ftruncate");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

void \*file_mem = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);  
    if (file_mem == MAP_FAILED) {  
        perror("mmap");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

memset(file_mem, 0, FILE_SIZE); //  使用内存

printf("Allocated %d MiB of file-mapped memory\n", FILE_SIZE / (1024 \* 1024));

//  保持映射，直到程序结束  
    // munmap(file_mem, FILE_SIZE);  
    // close(fd);  
    // unlink("tempfile");  
}

void allocate_anonpages() {  
    void \*anon_mem = malloc(ANON_SIZE);  
    if (anon_mem == NULL) {  
        perror("malloc");  
        exit(EXIT_FAILURE);  
    }

memset(anon_mem, 0, ANON_SIZE); //  使用内存

printf("Allocated %d MiB of anonymous memory\n", ANON_SIZE / (1024 \* 1024));  
  // free(anno_mem);  
}

void allocate_shmempages() {  
    int shmid = shmget(IPC_PRIVATE, SHM_SIZE, IPC_CREAT | 0600);  
    if (shmid == -1) {  
        perror("shmget");  
        exit(EXIT_FAILURE);  
    }

void *shm_mem = shmat(shmid, NULL, 0);  
    if (shm_mem == (void *)-1) {  
        perror("shmat");  
        shmctl(shmid, IPC_RMID, NULL);  
        exit(EXIT_FAILURE);  
    }

memset(shm_mem, 0, SHM_SIZE); //  使用内存  
    printf("Allocated %d MiB of shared memory\n", SHM_SIZE / (1024 \* 1024));

//  保持映射，直到程序结束  
    // shmdt(shm_mem);  
    // shmctl(shmid, IPC_RMID, NULL);  
}

int main() {  
    printf("Process %d\n", getpid());

allocate_filepages();  
    allocate_anonpages();  
    allocate_shmempages();

sleep(3600);

return 0;  
}

观察 top -p $pid 的输出:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                     
3881259 root      20   0   28540  24184  15872 S   0.0   0.1   0:00.01 a.out

通过 top 发现，进程的 RSS 是 24184KiB，比我们申请的 22MiB,也就是 22528KiB, 要大 1656KiB。

进一步观察/proc/$pid/status，发现：

....  
VmRSS:       24184 kB  
RssAnon:            8312 kB  
RssFile:            5632 kB  
RssShmem:          10240 kB  
VmData:     8436 kB  
VmStk:       132 kB  
VmExe:         4 kB  
VmLib:      1576 kB  
VmPTE:       100 kB  
VmSwap:        0 kB  
....

‍VmRSS 和 top 里看到 RES 完全一致。RssAnno 比 8092KiB 多了 120KiB，因为它还包括了 stack。RssFile 比 4096KiB 多了 1536KiB，因为它还包括了共享库。内核 mm_struct 计数并不总是完全及时和精准的。

阶段总结 2

2.2 容器(memcg)的 RSS

K8s 容器环境下，容器里的进程都归属同一个 cgroup 控制组，本文只关注内存控制组(memcg)。把刚才的代码做成容器镜像，部署在 TKEx 环境里, 观察容器内存使用相关指标。

观察到 container_memory_rss 只有 2047 * 4096 Bytes, 略小于 8MiB，远远低于上一节 top 观察到的 24MiB，这是为什么？

1.1 中通过观察/proc/$pid/status 和 top 的输出，我们得出了进程的 RSS 估算方法，即: 1）占主要部分的 malloc 导致的匿名页(brk/mmap 匿名映射) + 使用 shmem 共享内存 + mmap 文件映射；2）stack 部分，text 部分和动态链接库部分，页表部分，通常占比很小。

那 memory cgroup 的 RSS 的计算方法是不是就是简单地把 memcg 下归属的所有的进程 RSS 简单求和呢？显然不是。通过追溯cadvisor 相关代码, 发现这个数值来来自容器所属 cgroup path 下的 memory.stat 文本中的 rss 字段。

2.2.1 如何找到容器对应的 memcg path?

每个容器的 Memory Cgroup 路径根据其 QoS 类别和唯一标识符来确定。路径的基本格式如下：

Burstable：

/sys/fs/cgroup/memory/kubepods/burstable/pod<uid>/<container-id>

BestEffort：

/memory/kubepods/besteffort/pod<uid>/<container-id>

Guaranteed：

/sys/fs/cgroup/memory/kubepods/pod<uid>/<container-id>

可以通过查看 Pod Yaml 里的 Status 来确认 Pod 的 Qos 类别。

找到 memcg path 后，可以发现目录下有很多记录文件，这里关注 memory.stat:

root@memory-0:~# ls /sys/fs/cgroup/memory/kubepods/burstable/pod2d08e58b-50f7-41fa-bd42-946402c34646/b366c08f2ecedd6acdb38e4ec24913aea0ca3babeed297abbcfafafa4e8027de  
cgroup.clone_children         memory.bind_blkio               memory.kmem.tcp.max_usage_in_bytes  memory.memsw.max_usage_in_bytes  memory.pressure              memory.usage_in_bytes  
cgroup.event_control          memory.failcnt                  memory.kmem.tcp.usage_in_bytes      memory.memsw.usage_in_bytes      memory.pressure_level        memory.use_hierarchy  
cgroup.priority               memory.force_empty              memory.kmem.usage_in_bytes          memory.move_charge_at_immigrate  memory.priority_wmark_ratio  memory.use_priority_oom  
cgroup.procs                  memory.kmem.failcnt             memory.limit_in_bytes               memory.numa_stat                 memory.sli                   memory.vmstat  
memory.alloc_bps              memory.kmem.limit_in_bytes      memory.max_usage_in_bytes           memory.oom.group                 memory.sli_max               notify_on_release  
memory.async_distance_factor  memory.kmem.max_usage_in_bytes  memory.meminfo                      memory.oom_control               memory.soft_limit_in_bytes   tasks  
memory.async_high             memory.kmem.slabinfo            memory.meminfo_recursive            memory.pagecache.current         memory.stat  
memory.async_low              memory.kmem.tcp.failcnt         memory.memsw.failcnt                memory.pagecache.max_ratio       memory.swappiness  
memory.async_ratio            memory.kmem.tcp.limit_in_bytes  memory.memsw.limit_in_bytes         memory.pagecache.reclaim_ratio   memory.sync

2.2.2 memory.stat 里的 rss 是怎么计算的？

追溯 linux memory cgroup(后面记做 memcg)的相关源码，memcg 统计了以下内存使用:

static const unsigned int memcg1_stats[] = {  
  MEMCG_CACHE,  
  MEMCG_RSS,  
  MEMCG_RSS_HUGE,  
  NR_SHMEM,  
  NR_FILE_MAPPED,  
  NR_FILE_DIRTY,  
  NR_WRITEBACK,  
  MEMCG_SWAP,  
};

跟踪 MEMCG_RSS 的记录情况，发现只有匿名页的数量被统计到 MEMCG_RSS 里，这和前面观察的进程的 RSS 不一样。共享内存 page 只被计入 MEMCG_CACHE，即便它位于匿名 LRU。

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,  
           struct page *page,  
           bool compound, int nr_pages)  
{  
  /*  
   * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is  
   * counted as CACHE even if it's on ANON LRU.  
   */  
  if (PageAnon(page))  
    __mod_memcg_state(memcg, MEMCG_RSS, nr_pages);  
  else {  
    __mod_memcg_state(memcg, MEMCG_CACHE, nr_pages);  
    if (PageSwapBacked(page))  
      __mod_memcg_state(memcg, NR_SHMEM, nr_pages);  
  }  
  ....  
}

‍ 而我们之前观察到 container_memory_cache 接近 14MiB, 包括了 Shmem 和 mmap 文件映射的部分。这样得出的结论是，memory cgroup 的 RSS 只统计了上述代码中 malloc 分配出的内存，不包含另外两部分。

阶段总结 3

3 container_memory_cache

3.1 初识 PageCache

Page cache 是操作系统内核用来缓存文件系统数据的一种机制。它通过将文件数据缓存到内存中，从而减少磁盘 I/O 操作，提高文件读取的性能。当应用程序读取文件时，内核会首先检查 page cache，如果数据已经在缓存中，则直接从内存中读取，避免了磁盘访问。

以下是一个 C 语言小程序来演示如何通过读写文件来产生 PageCache, 这个程序写 100MiB 数据到指定的文本文件中。

#include <stdio.h>  
#include <stdlib.h>  
#include <fcntl.h>  
#include <unistd.h>  
#include <sys/types.h>  
#include <sys/stat.h>

#define BUFFER_SIZE 4096  
#define FILE_SIZE_MB 100

void generate_page_cache(const char \*filename) {  
    int fd;  
    char buffer[BUFFER_SIZE];  
    ssize_t bytes_written, bytes_read;  
    size_t total_bytes_written = 0;

//  初始化缓冲区  
    for (int i = 0; i < BUFFER_SIZE; i++) {  
        buffer[i] = 'A' + (i % 26); //  填充缓冲区以生成一些数据  
    }

//  打开文件进行写操作  
    fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);  
    if (fd == -1) {  
        perror("open");  
        exit(EXIT_FAILURE);  
    }

//  写入文件，直到文件大小达到  FILE_SIZE_MB  
    while (total_bytes_written < FILE_SIZE_MB * 1024 * 1024) {  
        bytes_written = write(fd, buffer, BUFFER_SIZE);  
        if (bytes_written == -1) {  
            perror("write");  
            close(fd);  
            exit(EXIT_FAILURE);  
        }  
        total_bytes_written += bytes_written;  
    }

//  关闭文件  
    close(fd);  
}

int main(int argc, char \*argv[]) {  
    if (argc != 2) {  
        fprintf(stderr, "Usage: %s <filename>\n", argv[0]);  
        exit(EXIT_FAILURE);  
    }

generate_page_cache(argv[1]);

printf("Page cache generated for file: %s\n", argv[1]);

return 0;  
}

在执行这个程序前，做一次 drop cache 操作，用来清理系统已有的 pagecache：

# sync && echo 3 > /proc/sys/vm/drop_caches

然后记录此时系统 pagecache 的信息。

# free -m  
               total        used        free      shared  buff/cache   available  
Mem:           32096        2470       29742         872        1152       29626  
Swap:              0           0           0  
# cat /proc/meminfo  
...  
Buffers:            4760 kB  
Cached:          1096448 kB  
SwapCached:            0 kB  
Active:           766032 kB  
Inactive:        1263964 kB  
Active(anon):     590144 kB  
Inactive(anon):  1231776 kB  
Active(file):     175888 kB  
Inactive(file):    32188 kB

编译运行小程序，再次查看系统 pagecache 信息。

# ./a.out cache.txt   
Page cache generated for file: cache.txt  
# free -m  
               total        used        free      shared  buff/cache   available  
Mem:           32096        2469       29640         872        1256       29627  
Swap:              0           0           0

# cat /proc/meminfo   
Buffers:            5116 kB  
Cached:          1199444 kB  
SwapCached:            0 kB  
Active:           766652 kB  
Inactive:        1366800 kB  
Active(anon):     590216 kB  
Inactive(anon):  1231776 kB  
Active(file):     176436 kB  
Inactive(file):   135024 kB

‍观察发现 /proc/meminfo 中的 Cached 增加了 102996KiB，约 100.5MiB；free -m 中 buff/cache 输出增长了 104MiB，两者都约等于我们写入的文件大小, 之所以略有不同，是因为系统还有其他进程也在运行影响 pagecache。

3.2 Active File 和 Inactive File

仔细观察刚才/proc/meminfo 的内容可以发现，增加的 100MiB pagecache 全部体现在 Inactive(File)这一项， Active(File) 基本没有变化。

事实上，第一次读写文件产生的 pagecache，都是 Inactive 的，只有当它再次被读写后，才会被对应的 page 放在 Active LRU 链表里。Linux 使用了 2 个 LRU 链表来分别管理 Active 和 Inactive pagecache，当系统内存不足时，处于 Inactive LRU 上的 pagecache 会优先被回收释放，有很多情况下文件内容往往只被读一次，比如日志文件，它们占用的 pagecache 需要首先被回收掉。

下面我们再测试一个小程序，创建一个文件并写入 100MiB 数据，然后连续两次读文件，观察/proc/meminfo 前后变化。

#include <stdio.h>  
#include <stdlib.h>  
#include <fcntl.h>  
#include <unistd.h>  
#include <sys/mman.h>  
#include <string.h>

#define FILE_SIZE (100 * 1024 * 1024) // 100 MiB

void read_file(const char \*filename) {  
    int fd = open(filename, O_RDONLY);  
    if (fd == -1) {  
        perror("open");  
        exit(EXIT_FAILURE);  
    }

char \*buffer = malloc(FILE_SIZE);  
    if (buffer == NULL) {  
        perror("malloc");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

ssize_t bytes_read = read(fd, buffer, FILE_SIZE);  
    if (bytes_read == -1) {  
        perror("read");  
        free(buffer);  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

printf("Read %zd bytes from file\n", bytes_read);

free(buffer);  
    close(fd);  
}

int main() {  
    const char \*filename = "testfile";

//  创建一个测试文件  
    int fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0600);  
    if (fd == -1) {  
        perror("open");  
        exit(EXIT_FAILURE);  
    }

if (ftruncate(fd, FILE_SIZE) == -1) {  
        perror("ftruncate");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

char \*file_mem = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);  
    if (file_mem == MAP_FAILED) {  
        perror("mmap");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

memset(file_mem, 'A', FILE_SIZE); //  初始化文件内容  
    munmap(file_mem, FILE_SIZE);  
    close(fd);

//  第一次读取文件内容  
    read_file(filename);

//  第二次读取文件内容  
    read_file(filename);

return 0;  
}

测试前进行 dropcache 并记录数据。

## cat /proc/meminfo   
Buffers:            4000 kB  
Cached:          1108280 kB  
SwapCached:            0 kB  
Active:           778248 kB  
Inactive:        1274056 kB  
Active(anon):     599416 kB  
Inactive(anon):  1241900 kB  
Active(file):     178832 kB  
Inactive(file):    32156 kB

完成测试，再次记录数据。

# ./a.out  
Read 104857600 bytes from file  
Read 104857600 bytes from file  
# cat /proc/meminfo   
Buffers:            6340 kB  
Cached:          1215868 kB  
SwapCached:            0 kB  
Active:           884284 kB  
Inactive:        1277620 kB  
Active(anon):     599088 kB  
Inactive(anon):  1241900 kB  
Active(file):     285196 kB  
Inactive(file):    35720 kB

‍ 这时发现，Active(File)增长了 103MiB，说明第二次读文件后，对应的 pagecache 被移动到 Active LRU 中。

3.3 容器中的 pagecache

追溯 cadvisor 的源码可以发现，container_memory_cache 来自 memcg 中 memory.stat 里的 cache 字段。再追溯 linux 源码，可以发现 cache 的取值源自 memcg 中的 MEMCG_CACHE 统计字段。注意 memcg 中的 MEMCG_CACHE 不仅包含了前面提到的 ActiveFile 和 InactiveFile pagecache，它还包括了前面 1.1 中提到的共享内存。

将 2.2 中的程序稍作修改令其常驻不退出，然后制作成容器镜像，部署在 TKEx 平台中，观察内容监控数据如下。

可以发现接近 pagecache 占了接近 100MiB，而 rss 使用量非常少。必须认识到，pagecache 也属于容器内存使用量。

开发者可能很少感知自身程序 pagecache 的使用情况，容器平台会对程序的内存使用做限制，那么是否需要担心 pagecache 的上涨导致程序内存使用量超过容器内存限制，导致程序被 OOM Kill?

实验探索这个问题。在一个 1GiB Memory Limit 容器中，已经通过 malloc/memset 使用了 0.8GiB 的 rss 内存，然后通过读 100MiB 磁盘文件产生 100MiB 左右的 pagecache，此时容器内存使用量大约为 0.9GiB，距离 1GiB 的限制量还差 100MiB。

这时候程序还能 malloc/memset 150Mi 内存吗? 程序是否会因为超过 memcg limit 而被 Kill?

编写如下程序然后制作容器镜像，部署到 TKEx 平台，将容器内存限制设置为 1GiB。

#include <stdio.h>  
#include <stdlib.h>  
#include <string.h>  
#include <unistd.h>  
#include <fcntl.h>  
#include <sys/types.h>  
#include <sys/stat.h>

#define ONE_GIB (1024 * 1024 * 1024)  
#define EIGHT_TENTHS_GIB (0.8 * ONE_GIB)  
#define ONE_HUNDRED_MIB (100 * 1024 * 1024)  
#define ONE_FIFTY_MIB (150 * 1024 \* 1024)  
#define FILE_PATH "/root/test.txt"

void allocate_memory(size_t size) {  
    char *buffer = (char *)malloc(size);  
    if (buffer == NULL) {  
        perror("malloc");  
        exit(EXIT_FAILURE);  
    }  
    memset(buffer, 0, size);  
}

void create_file(const char \*filename, size_t size) {  
    int fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);  
    if (fd == -1) {  
        perror("open");  
        exit(EXIT_FAILURE);  
    }

char *buffer = (char *)malloc(size);  
    if (buffer == NULL) {  
        perror("malloc");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

// Fill the buffer with random data  
    for (size_t i = 0; i < size; i++) {  
        buffer[i] = rand() % 256;  
    }

if (write(fd, buffer, size) != size) {  
        perror("write");  
        free(buffer);  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

free(buffer);  
    close(fd);  
}

void read_file(const char *filename, size_t size) {  
    FILE *file = fopen(filename, "r");  
    if (file == NULL) {  
        perror("fopen");  
        exit(EXIT_FAILURE);  
    }

char *buffer = (char *)malloc(size);  
    if (buffer == NULL) {  
        perror("malloc");  
        fclose(file);  
        exit(EXIT_FAILURE);  
    }

fread(buffer, 1, size, file);  
    fclose(file);  
    free(buffer);  
}

int main() {  
    printf("Allocating 0.8 GiB of RSS memory...\n");  
    allocate_memory(EIGHT_TENTHS_GIB);  
    printf("Waiting for 3 minutes...\n");  
    sleep(180);

printf("Creating a 100 MiB file with random data...\n");  
    create_file(FILE_PATH, ONE_HUNDRED_MIB);  
    printf("Waiting for 3 minutes...\n");  
    sleep(180);

printf("Reading 100 MiB from the file to generate pagecache...\n");  
    read_file(FILE_PATH, ONE_HUNDRED_MIB);

printf("Waiting for 3 minutes...\n");  
    sleep(180);

printf("Trying to allocate 150 MiB of memory...\n");  
    allocate_memory(ONE_FIFTY_MIB);  
    printf("Successfully allocated 150 MiB of memory.\n");

sleep(3600);  
    return 0;  
}

运行发现最后这 150MiB 内存是可以分配使用的，程序并没有被 Kill。

这是申请 150MiB 内存前，容器的内存使用监控记录:

这是申请 150MiB 内存后，容器的内存使用监控记录。

发现 rss 确实增长了 150MiB，pagecache 少了 45MiB，总内存达到 1023MiB, 并没有超过 1GiB 的限制。原因是在 memset 进入缺页中断分配物理页时，系统发现内存使用量会超过 memcg limit 的情况下，会先尝试回收 pagecache 以满足分配需求，优先回收前面提到的 Inactive File。由此可知，进程的 rss 不超过 memcg limit 的前提下, 可以放心申请使用内存，系统会及时释放 pagecache 来满足需求。pagecache 属于内核，不属于用户，当用户需要内存时，内核会通过回收 pagecache 来归还内存，但这可能是有代价的。

代价是什么？

pagecache 用于提升磁盘文件读写性能，pagecache 被回收意味着程序 IO 性能下降，延迟增加。因此生产环境一般严禁 dropcache 操作。
缺页中断进入更复杂的流程，page 申请变慢, 直接阻塞用户进程，造成应用程序性能下降。

频繁进行文件读写的容器经常会遇到内存使用率一直接近 99%的情况，就是由于 linux 为了提升文件读写性能，在 memcg 的限制内，尽可能地分配更多的 pagecache。

阶段总结 4

容器中的 cache 占用统计既包含了读写文件产生的 pagecache，也包括了使用共享内存的大小。

容器环境下, 内存使用量接近 memcg 限制时候，继续尝试申请分配内存会先触发 pagecache 回收，以满足分配需求。

4、container_memory_mapped_file

4.1 mmap 文件映射

mmap 不仅可以为程序分配匿名页，它还是一种内存映射文件的方法，允许将文件或设备的内容映射到进程的地址空间中。通过 mmap，可以直接访问甚至修改文件内容，就像访问内存一样，这通常比传统的文件 I/O 操作更高效。例如以下程序：

#include <stdio.h>  
#include <stdlib.h>  
#include <string.h>  
#include <unistd.h>  
#include <fcntl.h>  
#include <sys/mman.h>  
#include <sys/stat.h>

#define FILE_PATH "/root/test.txt"  
#define FILE_SIZE (100 * 1024 * 1024) // 100 MiB

void create_file(const char \*filename, size_t size) {  
    int fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);  
    if (fd == -1) {  
        perror("open");  
        exit(EXIT_FAILURE);  
    }

char *buffer = (char *)malloc(size);  
    if (buffer == NULL) {  
        perror("malloc");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

// Fill the buffer with 'A'  
    memset(buffer, 'A', size);

if (write(fd, buffer, size) != size) {  
        perror("write");  
        free(buffer);  
        close(fd);  
        exit(EXIT_FAILURE);  
    }

free(buffer);  
    close(fd);  
}

int main() {  
    // Step 1: Create a 100 MiB file with 'A'  
    printf("Creating a 100 MiB file with 'A'...\n");  
    create_file(FILE_PATH, FILE_SIZE);

// Step 2: Open the file for reading and writing  
    int fd = open(FILE_PATH, O_RDWR);  
    if (fd == -1) {  
        perror("open");  
        exit(EXIT_FAILURE);  
    }

// Step 3: Memory-map the file  
    char *mapped = (char *)mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);  
    if (mapped == MAP_FAILED) {  
        perror("mmap");  
        close(fd);  
        exit(EXIT_FAILURE);  
    }  
    // Step 4: Modify the file content through the memory-mapped region  
    printf("Modifying the file content to 'B'...\n");  
    memset(mapped, 'B', FILE_SIZE);  
    printf("File content successfully modified to 'B'.\n");

sleep(240);  
    // Step 5: Clean up  
    if (munmap(mapped, FILE_SIZE) == -1) {  
        perror("munmap");  
    }  
    close(fd);

return 0;  
}

‍先初始化一个 100MiB 的文本文件，内容全部是字母 A; 然后通过 mmap 将文件映射到程序地址空间里，通过 memset 将文件内容全改成字母 B。借助 mmap 文件映射，使用内存操作就能完成文件读写。相较于标准 buffered io, mmap 文件映射会拥有更好的性能，因为它避开了用户空间和内核空间的相互拷贝，这个优势在一次读写几十上百 MiB 的场景下尤为突出。

将这个程序制作成容器镜像，部署在 TKEx 平台中，观察内存监控记录。

可以发现, mmap, 即 container_memory_mmaped_file 的监控值接近 100MiB，而容器的 rss 依然非常低。观察/proc/<pid>/status：

...  
VmRSS:    103932 kB  
...

‍ 发现进程的 rss 依然约 101MiB。因此和前面提到的共享内存一样，mmap 文件映射部分的大小属于进程的 rss 而不属于容器的 rss。

4.2 mmap 共享内存

4.2.1 共享文件映射

基于 4.1 的启发，只要多个进程 mmap 相同一个文件，就可以通过这个文件实现共享内存，完成多进程通信，这种方式叫做共享文件映射。

调用 mmap 进行文件映射的时候，内核首先会在进程的虚拟内存空间中创建一个新的虚拟内存区域 VMA 用于映射文件，通过 vm_area_struct->vm_file 将映射文件的 struct flle 结构与虚拟内存映射关联起来。

struct vm_area_struct {  
    struct file * vm_file;      /* File we map to (can be NULL). */  
    unsigned long vm_pgoff;     /* Offset (within vm_file) in PAGE_SIZE */  
}

‍ 在缺页中断处理过程中，如果 vma 非匿名（即文件映射），linux 首先通过 vm_area_struct->vm_pgoff 激活对应的 pagecache 并预读部分磁盘文件内容到 pagecache 中，然后在页表中创建 PTE 并与 pagecache 文件页关联，完成缺页中断，此后对 vma 的访问实质上都是对 pagecache 的访问。进程 1 和进程 2 的共享文件映射，实质上是各自 vma 里的 file 字段最终指向了相同的文件，即相同的 inode。进程 1 和进程 2 对各自 vma 的访问也实质上是对相同的 pagecache 进行访问，这就是基于文件映射实现共享内存的原理。当然，对 vma 的内容修改也会导致对 pagecache 的修改，最终通过脏页回写完成对磁盘文件的修改，因此这种共享内存的方式会产生真实的磁盘 IO。

4.2.2 共享匿名映射

相对于共享文件映射，共享匿名映射也能实现共享内存，但只适用于父子进程之间。实现原理相对于共享文件映射略有类似，同样依赖了 pagecache，但这里的文件不再是具体的磁盘文件，而是 tmpfs。tmpfs 是一个基于内存实现的文件系统，因此基于 tmpfs 的共享内存不会产生真实的磁盘 IO。后面会了解到，基于 ipc 的共享内存，即 1.1 里通过 shmget 和 shmat 实现的共享内存，也是依靠 tmpfs 完成的。

4.3 容器中的 mapped file

回到 cadvisor 源码里，container_memory_mapped_file 取值于 memcg memory.stat 里的 mapped_file 字段，实际上就是 memcg 中的 NR_FILE_MAPPED 字段。所有 mmap 调用产生的文件页，都会被统计到 container_memory_mapped_file 中。根据 3.2.1 的描述，mmap 文件映射的原理与 pagecache 的行为紧密相关, mapped_file 也会伴随着 pagecache 一起出现。

此外，mapped_file 还包括 tmpfs 的使用量，下面来介绍 tmpfs 和 shmem。

5、tmpfs 与 shmem

5.1 emptyDir 的问题

emptyDir 允许用户选择内存作为挂载介质。

当这么做的时候，会发现挂载点(下图的/data)对应的文件系统是 tmpfs，这意味着/data 里的数据实际上都存储在内存中。

# df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  49.1G      2.7G     46.4G   5% /
tmpfs                     8.0G         0      8.0G   0% /data

如果没有为 emptyDir 卷设置 sizeLimit，/data 目录下的文件将占用 Pod 的内存；如果 Pod 没有设置内存 limit，则/data 可能消耗掉 Node 上全部的内存。

日常排障中经常收到客户的工单疑惑，进程似乎没有内存泄漏的情况，但内存使用量一直在上涨。通过面板发现 pagecache 一路上涨，最后发现挂载在 tmpfs 的/data/目录一直在输出程序 log。因此，请注意不要将 emptyDir 以内存为介质挂载后，将其作为输出日志目录。

5.2 System V IPC 共享内存

公司内部存在大量的 IPC 共享内存的使用场景，比如 spp 服务端框架。例如以下 C 语言程序例子：

5.2.1 Writer

#include <stdio.h>  
#include <stdlib.h>  
#include <sys/ipc.h>  
#include <sys/shm.h>  
#include <string.h>

#define SHM_SIZE 36 * 1024 * 1024  // 36 MiB

int main() {  
    key_t key = ftok("shmfile", 65);  //  生成一个唯一的 key  
    int shmid = shmget(key, SHM_SIZE, 0666 | IPC_CREAT);  //  创建共享内存段

if (shmid == -1) {  
        perror("shmget failed");  
        exit(1);  
    }

char *data = (char *)shmat(shmid, (void \*)0, 0);  //  连接到共享内存段

if (data == (char \*)(-1)) {  
        perror("shmat failed");  
        exit(1);  
    }

//  写入数据到共享内存  
    strcpy(data, "Hello, this is a message from the writer process!");

printf("Data written to shared memory: %s\n", data);

sleep(3600);

//  断开连接  
    if (shmdt(data) == -1) {  
        perror("shmdt failed");  
        exit(1);  
    }

return 0;  
}

‍5.2.2 Reader

#include <stdio.h>  
#include <stdlib.h>  
#include <sys/ipc.h>  
#include <sys/shm.h>  
#include <string.h>

#define SHM_SIZE 36 * 1024 * 1024  // 36 MiB

int main() {  
    key_t key = ftok("shmfile", 65);  //  生成一个唯一的 key  
    int shmid = shmget(key, SHM_SIZE, 0666);  //  获取共享内存段

if (shmid == -1) {  
        perror("shmget failed");  
        exit(1);  
    }

char *data = (char *)shmat(shmid, (void \*)0, 0);  //  连接到共享内存段

if (data == (char \*)(-1)) {  
        perror("shmat failed");  
        exit(1);  
    }

//  读取共享内存中的数据  
    printf("Data read from shared memory: %s\n", data);

sleep(3600);   
    //  断开连接  
    if (shmdt(data) == -1) {  
        perror("shmdt failed");  
        exit(1);  
    }

//  删除共享内存段  
    if (shmctl(shmid, IPC_RMID, NULL) == -1) {  
        perror("shmctl failed");  
        exit(1);  
    }

return 0;  
}

分辨编译执行 5.2.1 和 5.2.2，会发现 5.2.2 能读取到来自 5.2.1 的 Hello, this is a message from the writer process!。

同时执行 ipcs -m 可以看到我们分配到的 36MiB 共享内存。

# ipcs -m  
------ Shared Memory Segments --------  
key        shmid      owner      perms      bytes      nattch     status        
...  
0xffffffff 7          root       666        37748736   2

‍这时需要注意的是，当 Writer 和 Reader 进程都退出后，这部分内存依然在机器的 tmpfs 中，必须通过 ipcrm 命令来显示删除释放。

来到容器环境中，某个容器退出后，原进程中共享内存中的数据同样不会消失。如果剩余的容器没有使用该共享内存，这部分内存用量则只计入 Pod Level Memcg 的使用量。

如果你发现 Pod 的内存使用量明显大于所有容器内存使用量之和，可以通过 ipcs 查看是否存在 Shmem 数据。

6、监控实践

6.1 程序自监控内存用量的小技巧

linux 提供了一个系统调用 getrusage(2)用于获取进程自身以及其子进程的资源使用情况，在 1.1 中我们已经初步接触过了，再提供一个 go 语言的调用示例。

package main

import (  
        "fmt"  
        "syscall"  
        "time"  
)

func main() {  
        //  调用  getrusage  系统调用  
        var usage syscall.Rusage  
        err := syscall.Getrusage(syscall.RUSAGE_SELF, &usage)  
        if err != nil {  
                fmt.Printf("Error getting resource usage: %v\n", err)  
                return  
        }

//  打印资源使用情况  
        fmt.Printf("User CPU time used: %+v \n", usage.Utime)  
        fmt.Printf("System CPU time used: %+v \n", usage.Stime)  
        fmt.Printf("Maximum resident set size: %v \n", usage.Maxrss)  
        fmt.Printf("Integral shared memory size: %v \n", usage.Ixrss)  
        fmt.Printf("Integral unshared data size: %v \n", usage.Idrss)  
        fmt.Printf("Integral unshared stack size: %v \n", usage.Isrss)  
        fmt.Printf("Page reclaims (soft page faults): %v\n", usage.Minflt)  
        fmt.Printf("Page faults (hard page faults): %v\n", usage.Majflt)  
        fmt.Printf("Swaps: %v\n", usage.Nswap)  
        fmt.Printf("Block input operations: %v\n", usage.Inblock)  
        fmt.Printf("Block output operations: %v\n", usage.Oublock)  
        fmt.Printf("IPC messages sent: %v\n", usage.Msgsnd)  
        fmt.Printf("IPC messages received: %v\n", usage.Msgrcv)  
        fmt.Printf("Signals received: %v\n", usage.Nsignals)  
        fmt.Printf("Voluntary context switches: %v\n", usage.Nvcsw)  
        fmt.Printf("Involuntary context switches: %v\n", usage.Nivcsw)

//  模拟一些  CPU  负载  
        for i := 0; i < 1e8; i++ {  
                _ = i \* i  
        }

time.Sleep(2 \* time.Second)

//  再次调用  getrusage  系统调用  
        err = syscall.Getrusage(syscall.RUSAGE_SELF, &usage)  
        if err != nil {  
                fmt.Printf("Error getting resource usage: %v\n", err)  
                return  
        }

//  打印资源使用情况  
        fmt.Printf("\nAfter sleep:\n")  
        fmt.Printf("User CPU time used: %+v \n", usage.Utime)  
        fmt.Printf("System CPU time used: %+v \n", usage.Stime)  
        fmt.Printf("Maximum resident set size: %v \n", usage.Maxrss)  
        fmt.Printf("Integral shared memory size: %v \n", usage.Ixrss)  
        fmt.Printf("Integral unshared data size: %v \n", usage.Idrss)  
        fmt.Printf("Integral unshared stack size: %v \n", usage.Isrss)  
        fmt.Printf("Page reclaims (soft page faults): %v\n", usage.Minflt)  
        fmt.Printf("Page faults (hard page faults): %v\n", usage.Majflt)  
        fmt.Printf("Swaps: %v\n", usage.Nswap)  
        fmt.Printf("Block input operations: %v\n", usage.Inblock)  
        fmt.Printf("Block output operations: %v\n", usage.Oublock)  
        fmt.Printf("IPC messages sent: %v\n", usage.Msgsnd)  
        fmt.Printf("IPC messages received: %v\n", usage.Msgrcv)  
        fmt.Printf("Signals received: %v\n", usage.Nsignals)  
        fmt.Printf("Voluntary context switches: %v\n", usage.Nvcsw)  
        fmt.Printf("Involuntary context switches: %v\n", usage.Nivcsw)  
}

可见，getrusage(2) 还能帮助开发者自监控 CPU 使用率。

6.2 Top 和 Pid Namespace

在容器内执行 top 查看到的 cpu 和 memory 使用率通常并不是容器的真实使用率，因为/proc/stat 和/proc/meminfo 的视野是整个机器而非 Pod 或者容器。详情见以下。

如果你的容器部署在 TKE Serverless 节点中，TKEx 和 TKE AppFabric 也提供了 Pod 所在虚机的基础监控，如下图所示。

虚机的监控数据与 Top 的输出吻合。

如果你在容器内使用 top 观察进程的监控数据，需要明确的是 Pod 内不同容器的 Pid Namespace 默认是不共享的，你无法观察另一个容器的进程数据。

开启 Pid Namespace 共享可以获得更多的观测手段，比如使用带有 dlv, gdb 等调试工具的 sidecar 容器来调试主容器进程。但需要开启对应的特权，比如 ptrace，以及不能使用 Systemd 拉起富容器的模式部署业务。

6.3 我的容器内存使用率超过了 100%

我好像白薅了平台的内存，这是怎么回事？

如上图所示，内存使用量已经大幅度超过了容器本身的内存限制量，按照常识，容器会被 OOM Kill。然而现网中存在一些明显超过内存限制量却依然在正常运行的容器。

前文说过，K8s 为容器设置了 Pod 和 Container 级的 memcg 内存限制，任何一个容器内存使用量突破了 Container 层级的限制，会触发 OOM Kill; 所有容器内存使用和突破了 Pod 层级的限制，也会触发 OOM Kill。出现超限使用意味着这两道限制都已经失效。

排查发现，这类超限运行 Pod 普遍存在 2 个特征：

存在一个用 Systemd 拉起的富容器，Systemd 版本早于 236；
存在一个未配置 Limit 的 sidecar 容器。

两个特征同时满足的时候，K8s 设置的两层限制都会失效。如果容器开启特权并且/sys/fs/cgroup 被挂载，Systemd 会覆盖 K8s 为容器设置的 cgroup limit；任意一个未配置 Limit 的容器会使得 Pod 的 QOS 降级到 Burstable 甚至 BestEffort, Pod 层级的内存限制变成无穷大。

超限使用内存会导致 Node 的内存被占用，滋生稳定性风险。建议使用较新的 ubuntu/centos/tlinux 基础镜像，搭载较新版本的 Systemd 拉起业务容器，避免超限使用内存。

6.4 我担心 OOM Kill，配置哪个指标做内存使用告警？

通常基于 container_memory_working_set_bytes 做内存使用告警，内存使用率的计算公式为:

100 * container_memory_working_set_bytes{container="$container", pod="$pod", namespace="$namespace"}  
/ kube_pod_container_resource_limits{resource="memory", container="$container", pod="$pod", namespace="$namespace"} %

‍container_memory_working_set_bytes 在 memcg 的全部使用量的基础上，减去了 Inactive File 部分, 认为这部分 pagecache 可以迅速回收而不会给业务进程造成显著的负载压力，可以不计入容器的内存使用量。如下是 cadvisor 的统计代码细节。

workingSet := ret.Memory.Usage  
if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {  
    ret.Memory.TotalInactiveFile = v  
    if workingSet < v {  
       workingSet = 0  
    } else {  
       workingSet -= v  
    }  
}  
ret.Memory.WorkingSet = workingSet

7、结尾

一路过来，我们了解缺页中断的概念，RSS 的统计，认识了 Linux Memcg 内存控制组，观察了 pagecache 的分配和回收，初识了 tmpfs，以及在容器中使用共享内存等等。读到这里，文章开头提到的几个问题应该有了清晰的答案。祝大家的程序稳如泰山，永不 OOM。

END

作者：frostchen
文章来源：腾讯技术工程

推荐阅读

更多腾讯 AI 相关技术干货，请关注专栏腾讯技术工程 欢迎添加极术小姐姐微信（id:aijishu20)加入技术交流群，请备注研究方向。