searchusermenu
  • 发布文章
  • 消息中心
点赞
收藏
评论
分享
原创

内存泄漏排查

2024-09-25 09:31:57
6
0

问题表现

服务器出现明显卡顿,cpu和内存占用大,重启xx容器后问题消失。

排查方法

step1 检查容器内存消耗

30s打印一次内容占用,24小时内,内存占用提高了0.17GB

nohup sudo sh stats.sh 2>&1 &

#!/bin/bash

while true; do
    docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" | grep 5dc2b9d32f19 >> logs.txt
    sleep 30
done

step2 检查进程内存消耗

检查哪些进程的内存占用在增加

#!/bin/bash

# 输出文件
output_file="mem_usage_log.txt"

# 无限循环
while true; do
    # 临时文件用于存储未排序的进程内存使用情况
    temp_file=$(mktemp)

    # 获取每个进程的内存使用情况
    for pid in $(ls /proc | grep -E '^[0-9]+$'); do
        if [ -f /proc/$pid/status ]; then
            mem=$(grep VmRSS /proc/$pid/status | awk '{print $2}')
            name=$(ps -p $pid -o comm=)
            echo "$pid $name $mem" >> $temp_file
        fi
    done

    # 排序并格式化输出
    echo -e "Timestamp: $(date)" >> $output_file
    echo -e "PID\tProcess\tMemory(KB)" >> $output_file
    sort -n $temp_file | while read pid name mem; do
        echo -e "$pid\t$name\t$mem" >> $output_file
    done
    echo -e "\n" >> $output_file

    # 删除临时文件
    rm $temp_file

    # 等待30秒
    sleep 30
done

比较mem_usage_log.txt中的进程内存消耗增加和减少

PID

Process

Memory (KB) - 2024-06-28

Memory (KB) - 2024-07-01

Difference (KB)

217

gunicorn

241356

246616

+5260

229

gunicorn

156692

160572

+3880

29

python launch_ecs_worker.py ecs_default 2

96444

99868

+3424

219

gunicorn

115040

118324

+3284

27

python launch_ecs_worker.py ecs_default 2 True

110640

113944

+3304

33

python launch_ecs_worker.py ecs_default 2

98080

101852

+2772

21

python launch_ecs_worker.py ecs_default 2

100480

102200

+1720

22

python launch_ecs_worker.py ecs_default 2

99864

101480

+1616

10

python launch_ecs_worker.py ecs_poller_queue 2 True

96372

97168

+796

15

python launch_ecs_worker.py ecs_default

98112

99172

+1060

step3 检查哪些对象的内存占用在增加

1) 使用 tracemalloc 库导出内存快照进行比对

tracemalloc 是一个用于对 Python 已申请的内存块进行 Debug 的工具,可为我们提供以下信息:

  1. 定位对象分配内存的位置。
  2. 按文件、按行统计python的内存块分配情况: 总大小、块的数量以及块平均大小。
  3. 对比两个内存快照的差异,以便排查内存泄漏。

import tracemalloc
from datetime import datetime

class SaveMemView(BaseViewECS):
    async def get(self):
        try:
            snapshot = tracemalloc.take_snapshot()
            # 保存快照到文件
            now = datetime.now()
            timestamp_str = now.strftime('%Y%m%d%H%M%S%f')[:-3]
            snapshot_file = f'tracemalloc-{timestamp_str}'
            snapshot.dump(snapshot_file)
            self.show_memory(timestamp_str)
            return self.common_success_response('')
        except Exception as e:
            return self.common_error_response(biz_error_msg=traceback.format_exc())

调用接口导出内存快照

curl 'http://127.0.0.1:12080/api/mem_check/start/'
curl 'http://127.0.0.1:12080/api/mem_check/save/'
curl 'http://127.0.0.1:12080/api/mem_check/stop/'

快照比对

import tracemalloc
snapshot_file1 = 'tracemalloc-20240512180542633'
loaded_snapshot1 = tracemalloc.Snapshot.load(snapshot_file1)

snapshot_file2 = 'tracemalloc-20240513010654228'
loaded_snapshot2 = tracemalloc.Snapshot.load(snapshot_file2)

# 分析快照,例如输出前10个最大的内存块
top_stats = loaded_snapshot2.compare_to(loaded_snapshot1, 'lineno')

for stat in top_stats[:10]:
    print(stat)

对比结果

python3 mem_compare.py

/usr/local/lib/python3.6/threading.py:846: size=185 KiB (+185 KiB), count=164 (+164), average=1156 B

/usr/local/lib/python3.6/urllib/request.py:1377: size=174 KiB (+151 KiB), count=267 (+229), average=667 B

/usr/local/lib/python3.6/http/client.py:956: size=94.9 KiB (+72.3 KiB), count=150 (+112), average=648 B

/usr/local/lib/python3.6/threading.py:884: size=64.8 KiB (+64.8 KiB), count=79 (+79), average=840 B

/usr/local/lib/python3.6/_sitebuiltins.py:57: size=25.6 KiB (+25.6 KiB), count=222 (+222), average=118 B

/usr/local/lib/python3.6/_weakrefset.py:84: size=120 KiB (+25.4 KiB), count=1095 (+295), average=112 B

/usr/local/lib/python3.6/http/client.py:1286: size=38.9 KiB (+24.2 KiB), count=69 (+41), average=577 B

/usr/local/lib/python3.6/_weakrefset.py:37: size=24.8 KiB (+23.7 KiB), count=192 (+184), average=132 B

/var/www/yacos/common/sdk/tracing/aiohttp/middleware.py:31: size=22.7 KiB (+21.7 KiB), count=28 (+26), average=829 B

/usr/local/lib/python3.6/_weakrefset.py:48: size=21.0 KiB (+20.1 KiB), count=96 (+92), average=224 B

2)使用py-spy

Py-Spy 是一个强大的 Python 程序性能分析工具,能够在运行程序的同时对其进行采样分析,而且无需修改源代码或重新启动进程。它可以帮助开发者理解和优化 Python 程序的性能。

py-spy git仓库

使用方法

py-spy record -o profile.svg --pid 199046

py-spy top --pid 199005

3)使用 gc库

gc(garbage collector)是Python 标准库,该module提供垃圾回收相对应的接口。通过这个module,可以开关gc、调整垃圾回收的频率、输出调试信息


输出内存占用最大的对象
import gc
import sys

def show_memory(self, timestamp_str):
    object_list = []
    for obj in gc.get_objects():
        size = sys.getsizeof(obj)
        object_list.append((obj,size))
    sorted_values = sorted(object_list,
                           key=lambda x: x[1],
                           reverse=True)

    file_path = f"gc-{timestamp_str}.txt"
    with open(file_path, 'w', encoding='utf-8') as f:
        for obj, size in sorted_values[:10]:
            f.write(f"OBJ: {id(obj)}, "
                    f"TYPE: {type(obj)}, "
                    f"SIZE: {size / 1024 / 1024:.2f} MB, "
                    f"REPR: {str(obj)[:100]}, "
                    f"LEN: {len(sorted_values)}\n")

对应的结果

cat gc-20240701021542039.txt

OBJ: 139989418379208, TYPE: <class 'list'>, SIZE: 0.13 MB, REPR: [(0, 64, (('/usr/local/lib/python3.6/typing.py', 1231),)), (0, 64, (('/usr/local/lib/python3.6/typin, LEN: 198677

OBJ: 139991870000296, TYPE: <class 'dict'>, SIZE: 0.07 MB, REPR: {139991878254560: <weakref at 0x7f5265aee638; to 'type' at 0x7f52662cbfe0 (type)>, 139991878271072: , LEN: 198677

OBJ: 139991870197696, TYPE: <class 'dict'>, SIZE: 0.07 MB, REPR: {'builtins': <module 'builtins' (built-in)>, 'sys': <module 'sys' (built-in)>, '_frozen_importlib': , LEN: 198677

OBJ: 139991868766408, TYPE: <class 'list'>, SIZE: 0.05 MB, REPR: ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018 Paul T. McGuire\n', '#\n', '# Permiss, LEN: 198677

OBJ: 139991848963168, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'application/octet-stream': ['.a', '.bin', '.dll', '.exe', '.o', '.obj', '.so', '.deploy', '.msu', , LEN: 198677

OBJ: 139991829458088, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'__name__': 'microservices.ecs.route', '__doc__': None, '__package__': 'microservices.ecs', '__load, LEN: 198677

OBJ: 139991730020424, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'CRYPTOGRAPHY_PACKAGE_VERSION': <cdata 'char *' 0x7f525d42c000>, 'Cryptography_HAS_EC2M': 1, 'Crypt, LEN: 198677

OBJ: 139991729969912, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'__name__': 'lib', '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, '_or, LEN: 198677

OBJ: 139991783838416, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'/debug/log/{mark:\\d+}/': <class 'microservices.ecs.api.debug.DebugLogQuery'>, '/logws': <class 'm, LEN: 198677

OBJ: 139991845320744, TYPE: <class 'set'>, SIZE: 0.03 MB, REPR: {<weakref at 0x7f51d3b00b88; to 'ABCMeta' at 0x558e29988058 (VmOsCreateView)>, <weakref at 0x7f51d3a, LEN: 198677

0条评论
0 / 1000
s****n
10文章数
0粉丝数
s****n
10 文章 | 0 粉丝
原创

内存泄漏排查

2024-09-25 09:31:57
6
0

问题表现

服务器出现明显卡顿,cpu和内存占用大,重启xx容器后问题消失。

排查方法

step1 检查容器内存消耗

30s打印一次内容占用,24小时内,内存占用提高了0.17GB

nohup sudo sh stats.sh 2>&1 &

#!/bin/bash

while true; do
    docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" | grep 5dc2b9d32f19 >> logs.txt
    sleep 30
done

step2 检查进程内存消耗

检查哪些进程的内存占用在增加

#!/bin/bash

# 输出文件
output_file="mem_usage_log.txt"

# 无限循环
while true; do
    # 临时文件用于存储未排序的进程内存使用情况
    temp_file=$(mktemp)

    # 获取每个进程的内存使用情况
    for pid in $(ls /proc | grep -E '^[0-9]+$'); do
        if [ -f /proc/$pid/status ]; then
            mem=$(grep VmRSS /proc/$pid/status | awk '{print $2}')
            name=$(ps -p $pid -o comm=)
            echo "$pid $name $mem" >> $temp_file
        fi
    done

    # 排序并格式化输出
    echo -e "Timestamp: $(date)" >> $output_file
    echo -e "PID\tProcess\tMemory(KB)" >> $output_file
    sort -n $temp_file | while read pid name mem; do
        echo -e "$pid\t$name\t$mem" >> $output_file
    done
    echo -e "\n" >> $output_file

    # 删除临时文件
    rm $temp_file

    # 等待30秒
    sleep 30
done

比较mem_usage_log.txt中的进程内存消耗增加和减少

PID

Process

Memory (KB) - 2024-06-28

Memory (KB) - 2024-07-01

Difference (KB)

217

gunicorn

241356

246616

+5260

229

gunicorn

156692

160572

+3880

29

python launch_ecs_worker.py ecs_default 2

96444

99868

+3424

219

gunicorn

115040

118324

+3284

27

python launch_ecs_worker.py ecs_default 2 True

110640

113944

+3304

33

python launch_ecs_worker.py ecs_default 2

98080

101852

+2772

21

python launch_ecs_worker.py ecs_default 2

100480

102200

+1720

22

python launch_ecs_worker.py ecs_default 2

99864

101480

+1616

10

python launch_ecs_worker.py ecs_poller_queue 2 True

96372

97168

+796

15

python launch_ecs_worker.py ecs_default

98112

99172

+1060

step3 检查哪些对象的内存占用在增加

1) 使用 tracemalloc 库导出内存快照进行比对

tracemalloc 是一个用于对 Python 已申请的内存块进行 Debug 的工具,可为我们提供以下信息:

  1. 定位对象分配内存的位置。
  2. 按文件、按行统计python的内存块分配情况: 总大小、块的数量以及块平均大小。
  3. 对比两个内存快照的差异,以便排查内存泄漏。

import tracemalloc
from datetime import datetime

class SaveMemView(BaseViewECS):
    async def get(self):
        try:
            snapshot = tracemalloc.take_snapshot()
            # 保存快照到文件
            now = datetime.now()
            timestamp_str = now.strftime('%Y%m%d%H%M%S%f')[:-3]
            snapshot_file = f'tracemalloc-{timestamp_str}'
            snapshot.dump(snapshot_file)
            self.show_memory(timestamp_str)
            return self.common_success_response('')
        except Exception as e:
            return self.common_error_response(biz_error_msg=traceback.format_exc())

调用接口导出内存快照

curl 'http://127.0.0.1:12080/api/mem_check/start/'
curl 'http://127.0.0.1:12080/api/mem_check/save/'
curl 'http://127.0.0.1:12080/api/mem_check/stop/'

快照比对

import tracemalloc
snapshot_file1 = 'tracemalloc-20240512180542633'
loaded_snapshot1 = tracemalloc.Snapshot.load(snapshot_file1)

snapshot_file2 = 'tracemalloc-20240513010654228'
loaded_snapshot2 = tracemalloc.Snapshot.load(snapshot_file2)

# 分析快照,例如输出前10个最大的内存块
top_stats = loaded_snapshot2.compare_to(loaded_snapshot1, 'lineno')

for stat in top_stats[:10]:
    print(stat)

对比结果

python3 mem_compare.py

/usr/local/lib/python3.6/threading.py:846: size=185 KiB (+185 KiB), count=164 (+164), average=1156 B

/usr/local/lib/python3.6/urllib/request.py:1377: size=174 KiB (+151 KiB), count=267 (+229), average=667 B

/usr/local/lib/python3.6/http/client.py:956: size=94.9 KiB (+72.3 KiB), count=150 (+112), average=648 B

/usr/local/lib/python3.6/threading.py:884: size=64.8 KiB (+64.8 KiB), count=79 (+79), average=840 B

/usr/local/lib/python3.6/_sitebuiltins.py:57: size=25.6 KiB (+25.6 KiB), count=222 (+222), average=118 B

/usr/local/lib/python3.6/_weakrefset.py:84: size=120 KiB (+25.4 KiB), count=1095 (+295), average=112 B

/usr/local/lib/python3.6/http/client.py:1286: size=38.9 KiB (+24.2 KiB), count=69 (+41), average=577 B

/usr/local/lib/python3.6/_weakrefset.py:37: size=24.8 KiB (+23.7 KiB), count=192 (+184), average=132 B

/var/www/yacos/common/sdk/tracing/aiohttp/middleware.py:31: size=22.7 KiB (+21.7 KiB), count=28 (+26), average=829 B

/usr/local/lib/python3.6/_weakrefset.py:48: size=21.0 KiB (+20.1 KiB), count=96 (+92), average=224 B

2)使用py-spy

Py-Spy 是一个强大的 Python 程序性能分析工具,能够在运行程序的同时对其进行采样分析,而且无需修改源代码或重新启动进程。它可以帮助开发者理解和优化 Python 程序的性能。

py-spy git仓库

使用方法

py-spy record -o profile.svg --pid 199046

py-spy top --pid 199005

3)使用 gc库

gc(garbage collector)是Python 标准库,该module提供垃圾回收相对应的接口。通过这个module,可以开关gc、调整垃圾回收的频率、输出调试信息


输出内存占用最大的对象
import gc
import sys

def show_memory(self, timestamp_str):
    object_list = []
    for obj in gc.get_objects():
        size = sys.getsizeof(obj)
        object_list.append((obj,size))
    sorted_values = sorted(object_list,
                           key=lambda x: x[1],
                           reverse=True)

    file_path = f"gc-{timestamp_str}.txt"
    with open(file_path, 'w', encoding='utf-8') as f:
        for obj, size in sorted_values[:10]:
            f.write(f"OBJ: {id(obj)}, "
                    f"TYPE: {type(obj)}, "
                    f"SIZE: {size / 1024 / 1024:.2f} MB, "
                    f"REPR: {str(obj)[:100]}, "
                    f"LEN: {len(sorted_values)}\n")

对应的结果

cat gc-20240701021542039.txt

OBJ: 139989418379208, TYPE: <class 'list'>, SIZE: 0.13 MB, REPR: [(0, 64, (('/usr/local/lib/python3.6/typing.py', 1231),)), (0, 64, (('/usr/local/lib/python3.6/typin, LEN: 198677

OBJ: 139991870000296, TYPE: <class 'dict'>, SIZE: 0.07 MB, REPR: {139991878254560: <weakref at 0x7f5265aee638; to 'type' at 0x7f52662cbfe0 (type)>, 139991878271072: , LEN: 198677

OBJ: 139991870197696, TYPE: <class 'dict'>, SIZE: 0.07 MB, REPR: {'builtins': <module 'builtins' (built-in)>, 'sys': <module 'sys' (built-in)>, '_frozen_importlib': , LEN: 198677

OBJ: 139991868766408, TYPE: <class 'list'>, SIZE: 0.05 MB, REPR: ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018 Paul T. McGuire\n', '#\n', '# Permiss, LEN: 198677

OBJ: 139991848963168, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'application/octet-stream': ['.a', '.bin', '.dll', '.exe', '.o', '.obj', '.so', '.deploy', '.msu', , LEN: 198677

OBJ: 139991829458088, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'__name__': 'microservices.ecs.route', '__doc__': None, '__package__': 'microservices.ecs', '__load, LEN: 198677

OBJ: 139991730020424, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'CRYPTOGRAPHY_PACKAGE_VERSION': <cdata 'char *' 0x7f525d42c000>, 'Cryptography_HAS_EC2M': 1, 'Crypt, LEN: 198677

OBJ: 139991729969912, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'__name__': 'lib', '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, '_or, LEN: 198677

OBJ: 139991783838416, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'/debug/log/{mark:\\d+}/': <class 'microservices.ecs.api.debug.DebugLogQuery'>, '/logws': <class 'm, LEN: 198677

OBJ: 139991845320744, TYPE: <class 'set'>, SIZE: 0.03 MB, REPR: {<weakref at 0x7f51d3b00b88; to 'ABCMeta' at 0x558e29988058 (VmOsCreateView)>, <weakref at 0x7f51d3a, LEN: 198677

文章来自个人专栏
python学习笔记
7 文章 | 1 订阅
0条评论
0 / 1000
请输入你的评论
0
0