问题表现
服务器出现明显卡顿,cpu和内存占用大,重启xx容器后问题消失。
排查方法
step1 检查容器内存消耗
30s打印一次内容占用,24小时内,内存占用提高了0.17GB
nohup sudo sh stats.sh 2>&1 &
#!/bin/bash
while true; do
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" | grep 5dc2b9d32f19 >> logs.txt
sleep 30
done
step2 检查进程内存消耗
检查哪些进程的内存占用在增加
#!/bin/bash
# 输出文件
output_file="mem_usage_log.txt"
# 无限循环
while true; do
# 临时文件用于存储未排序的进程内存使用情况
temp_file=$(mktemp)
# 获取每个进程的内存使用情况
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
if [ -f /proc/$pid/status ]; then
mem=$(grep VmRSS /proc/$pid/status | awk '{print $2}')
name=$(ps -p $pid -o comm=)
echo "$pid $name $mem" >> $temp_file
fi
done
# 排序并格式化输出
echo -e "Timestamp: $(date)" >> $output_file
echo -e "PID\tProcess\tMemory(KB)" >> $output_file
sort -n $temp_file | while read pid name mem; do
echo -e "$pid\t$name\t$mem" >> $output_file
done
echo -e "\n" >> $output_file
# 删除临时文件
rm $temp_file
# 等待30秒
sleep 30
done
比较mem_usage_log.txt中的进程内存消耗增加和减少
PID |
Process |
Memory (KB) - 2024-06-28 |
Memory (KB) - 2024-07-01 |
Difference (KB) |
217 |
gunicorn |
241356 |
246616 |
+5260 |
229 |
gunicorn |
156692 |
160572 |
+3880 |
29 |
python launch_ecs_worker.py ecs_default 2 |
96444 |
99868 |
+3424 |
219 |
gunicorn |
115040 |
118324 |
+3284 |
27 |
python launch_ecs_worker.py ecs_default 2 True |
110640 |
113944 |
+3304 |
33 |
python launch_ecs_worker.py ecs_default 2 |
98080 |
101852 |
+2772 |
21 |
python launch_ecs_worker.py ecs_default 2 |
100480 |
102200 |
+1720 |
22 |
python launch_ecs_worker.py ecs_default 2 |
99864 |
101480 |
+1616 |
10 |
python launch_ecs_worker.py ecs_poller_queue 2 True |
96372 |
97168 |
+796 |
15 |
python launch_ecs_worker.py ecs_default |
98112 |
99172 |
+1060 |
step3 检查哪些对象的内存占用在增加
1) 使用 tracemalloc 库导出内存快照进行比对
tracemalloc 是一个用于对 Python 已申请的内存块进行 Debug 的工具,可为我们提供以下信息:
- 定位对象分配内存的位置。
- 按文件、按行统计python的内存块分配情况: 总大小、块的数量以及块平均大小。
- 对比两个内存快照的差异,以便排查内存泄漏。
import tracemalloc
from datetime import datetime
class SaveMemView(BaseViewECS):
async def get(self):
try:
snapshot = tracemalloc.take_snapshot()
# 保存快照到文件
now = datetime.now()
timestamp_str = now.strftime('%Y%m%d%H%M%S%f')[:-3]
snapshot_file = f'tracemalloc-{timestamp_str}'
snapshot.dump(snapshot_file)
self.show_memory(timestamp_str)
return self.common_success_response('')
except Exception as e:
return self.common_error_response(biz_error_msg=traceback.format_exc())
调用接口导出内存快照
curl 'http://127.0.0.1:12080/api/mem_check/start/'
curl 'http://127.0.0.1:12080/api/mem_check/save/'
curl 'http://127.0.0.1:12080/api/mem_check/stop/'
快照比对
import tracemalloc
snapshot_file1 = 'tracemalloc-20240512180542633'
loaded_snapshot1 = tracemalloc.Snapshot.load(snapshot_file1)
snapshot_file2 = 'tracemalloc-20240513010654228'
loaded_snapshot2 = tracemalloc.Snapshot.load(snapshot_file2)
# 分析快照,例如输出前10个最大的内存块
top_stats = loaded_snapshot2.compare_to(loaded_snapshot1, 'lineno')
for stat in top_stats[:10]:
print(stat)
对比结果
python3 mem_compare.py
/usr/local/lib/python3.6/threading.py:846: size=185 KiB (+185 KiB), count=164 (+164), average=1156 B
/usr/local/lib/python3.6/urllib/request.py:1377: size=174 KiB (+151 KiB), count=267 (+229), average=667 B
/usr/local/lib/python3.6/http/client.py:956: size=94.9 KiB (+72.3 KiB), count=150 (+112), average=648 B
/usr/local/lib/python3.6/threading.py:884: size=64.8 KiB (+64.8 KiB), count=79 (+79), average=840 B
/usr/local/lib/python3.6/_sitebuiltins.py:57: size=25.6 KiB (+25.6 KiB), count=222 (+222), average=118 B
/usr/local/lib/python3.6/_weakrefset.py:84: size=120 KiB (+25.4 KiB), count=1095 (+295), average=112 B
/usr/local/lib/python3.6/http/client.py:1286: size=38.9 KiB (+24.2 KiB), count=69 (+41), average=577 B
/usr/local/lib/python3.6/_weakrefset.py:37: size=24.8 KiB (+23.7 KiB), count=192 (+184), average=132 B
/var/www/yacos/common/sdk/tracing/aiohttp/middleware.py:31: size=22.7 KiB (+21.7 KiB), count=28 (+26), average=829 B
/usr/local/lib/python3.6/_weakrefset.py:48: size=21.0 KiB (+20.1 KiB), count=96 (+92), average=224 B
2)使用py-spy
Py-Spy 是一个强大的 Python 程序性能分析工具,能够在运行程序的同时对其进行采样分析,而且无需修改源代码或重新启动进程。它可以帮助开发者理解和优化 Python 程序的性能。
使用方法
py-spy record -o profile.svg --pid 199046
py-spy top --pid 199005
3)使用 gc库
gc(garbage collector)是Python 标准库,该module提供垃圾回收相对应的接口。通过这个module,可以开关gc、调整垃圾回收的频率、输出调试信息
输出内存占用最大的对象
import gc
import sys
def show_memory(self, timestamp_str):
object_list = []
for obj in gc.get_objects():
size = sys.getsizeof(obj)
object_list.append((obj,size))
sorted_values = sorted(object_list,
key=lambda x: x[1],
reverse=True)
file_path = f"gc-{timestamp_str}.txt"
with open(file_path, 'w', encoding='utf-8') as f:
for obj, size in sorted_values[:10]:
f.write(f"OBJ: {id(obj)}, "
f"TYPE: {type(obj)}, "
f"SIZE: {size / 1024 / 1024:.2f} MB, "
f"REPR: {str(obj)[:100]}, "
f"LEN: {len(sorted_values)}\n")
对应的结果
cat gc-20240701021542039.txt
OBJ: 139989418379208, TYPE: <class 'list'>, SIZE: 0.13 MB, REPR: [(0, 64, (('/usr/local/lib/python3.6/typing.py', 1231),)), (0, 64, (('/usr/local/lib/python3.6/typin, LEN: 198677
OBJ: 139991870000296, TYPE: <class 'dict'>, SIZE: 0.07 MB, REPR: {139991878254560: <weakref at 0x7f5265aee638; to 'type' at 0x7f52662cbfe0 (type)>, 139991878271072: , LEN: 198677
OBJ: 139991870197696, TYPE: <class 'dict'>, SIZE: 0.07 MB, REPR: {'builtins': <module 'builtins' (built-in)>, 'sys': <module 'sys' (built-in)>, '_frozen_importlib': , LEN: 198677
OBJ: 139991868766408, TYPE: <class 'list'>, SIZE: 0.05 MB, REPR: ['# module pyparsing.py\n', '#\n', '# Copyright (c) 2003-2018 Paul T. McGuire\n', '#\n', '# Permiss, LEN: 198677
OBJ: 139991848963168, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'application/octet-stream': ['.a', '.bin', '.dll', '.exe', '.o', '.obj', '.so', '.deploy', '.msu', , LEN: 198677
OBJ: 139991829458088, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'__name__': 'microservices.ecs.route', '__doc__': None, '__package__': 'microservices.ecs', '__load, LEN: 198677
OBJ: 139991730020424, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'CRYPTOGRAPHY_PACKAGE_VERSION': <cdata 'char *' 0x7f525d42c000>, 'Cryptography_HAS_EC2M': 1, 'Crypt, LEN: 198677
OBJ: 139991729969912, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'__name__': 'lib', '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, '_or, LEN: 198677
OBJ: 139991783838416, TYPE: <class 'dict'>, SIZE: 0.04 MB, REPR: {'/debug/log/{mark:\\d+}/': <class 'microservices.ecs.api.debug.DebugLogQuery'>, '/logws': <class 'm, LEN: 198677
OBJ: 139991845320744, TYPE: <class 'set'>, SIZE: 0.03 MB, REPR: {<weakref at 0x7f51d3b00b88; to 'ABCMeta' at 0x558e29988058 (VmOsCreateView)>, <weakref at 0x7f51d3a, LEN: 198677