背景:
前两周经常收到反馈测试环境console-server服务不稳定,接口返回错误和500,重启应用后就能恢复了,过了一两天还是会出现接口返回错误,影响迭代联调开发的效率和进度,需要进一步定位解决问题
问题分析:
1、错误消息和堆栈信息
简单分析下,大致两类错误。一类传参导致的结果返回异常,比较好解决。
另一类500错误,接口超时、无响应
{"code":500,"msg":"org.springframework.web.reactive.function.client.WebClientRequestException: finishConnect(..) failed: Connection refused: /10.50.208.131:8817; nested exception is io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.50.208.131:8817","success":false}
结合之前的解决方式和现象看上去像OOM
2、异常类型
default环境由于没有接日志和监控,通过SSH查看pod日志,出现堆溢出 -java.lang.OutOfMemoryError: Java heap space,没有发现其它有效的信息
3、环境和配置信息
翼集上没有配置jvm信息,pod mem limit为8G,按照JVM默认规则,Xmx=2G old区域2G*3/4
进入pod,打印jvm信息
调整 -Xmx4g,运行1天后,现象依旧
4、内存分析
4.1 添加java agent
可以清晰看到在1天时间内老年代内存逐步增加,最终导致进程OOM
jmap分析可以看到,内存中有大量MgrAgent对象,超过100W,推测可能是MgrAgent导致的OOM。需要进一步分析这些MgrAgent是怎么创建,root References path。
按照以下三种思路进一步分析
看代码排查
查找MgrAgent对象引用,引用69处,比较可疑的地方有两个cache缓存,没有设置map大小,但是有过期时间。里面保存的是agent心跳,按照业务规则推算map.size不大可能有100W,而且从老年代内存上看,内存持续增加,时间跨度达1天,排除此选项
在线debug
引入arthas,查看map.size只有一千多。
dump内存,用工具分析
堆内存分析
观察引用路径,thread,linkedBlickingQueue,org.springframework.aop.interceptor.AsyncExecutionInterceptor$$Lambda$1140,到这里,问题已经清晰了,spring引入异步线程的执行机制,堆积对象过多导致堆内存OOM
5、代码审查
从上述分析得到是spring异步机制导致的,搜索async
查看@Async原理spring.io/guides/gs/async-method/,如果你没有定义一个 Executor bean,Spring 会创建一个 SimpleAsyncTaskExecutor 并使用它。
SimpleAsyncTaskExecutor 的特点:
每次执行都会创建新线程
没有控制并发上线等策略控制
这种方式易引发性能问题、OOM等现象
-------------------------------------------------如果我们引入spring boot
AsyncExecutionInterceptor
@Nullable
public Object invoke(MethodInvocation invocation) throws Throwable {
Class<?> targetClass = invocation.getThis() != null ? AopUtils.getTargetClass(invocation.getThis()) : null;
Method specificMethod = ClassUtils.getMostSpecificMethod(invocation.getMethod(), targetClass);
Method userDeclaredMethod = BridgeMethodResolver.findBridgedMethod(specificMethod);
AsyncTaskExecutor executor = this.determineAsyncExecutor(userDeclaredMethod);
......
}
TaskExecutionAutoConfiguration.class
@Bean
@ConditionalOnMissingBean
public TaskExecutorBuilder taskExecutorBuilder(TaskExecutionProperties properties, ObjectProvider<TaskExecutorCustomizer> taskExecutorCustomizers, ObjectProvider<TaskDecorator> taskDecorator) {
TaskExecutionProperties.Pool pool = properties.getPool();
TaskExecutorBuilder builder = new TaskExecutorBuilder();
builder = builder.queueCapacity(pool.getQueueCapacity());
builder = builder.corePoolSize(pool.getCoreSize());
builder = builder.maxPoolSize(pool.getMaxSize());
builder = builder.allowCoreThreadTimeOut(pool.isAllowCoreThreadTimeout());
builder = builder.keepAlive(pool.getKeepAlive());
TaskExecutionProperties.Shutdown shutdown = properties.getShutdown();
builder = builder.awaitTermination(shutdown.isAwaitTermination());
builder = builder.awaitTerminationPeriod(shutdown.getAwaitTerminationPeriod());
builder = builder.threadNamePrefix(properties.getThreadNamePrefix());
Stream var10001 = taskExecutorCustomizers.orderedStream();
var10001.getClass();
builder = builder.customizers(var10001::iterator);
builder = builder.taskDecorator((TaskDecorator)taskDecorator.getIfUnique());
return builder;
}
@Lazy
@Bean(
name = {"applicationTaskExecutor", "taskExecutor"}
)
@ConditionalOnMissingBean({Executor.class})
public ThreadPoolTaskExecutor applicationTaskExecutor(TaskExecutorBuilder builder) {
return builder.build();
}
public static class Pool {
private int queueCapacity = Integer.MAX_VALUE;
private int coreSize = 8;
private int maxSize = Integer.MAX_VALUE;
private boolean allowCoreThreadTimeout = true;
private Duration keepAlive = Duration.ofSeconds(60L);
...
}
这个任务线程池比SimpleAsyncTaskExecutor 稍好,核心线程数8,阻塞队列大小是Integer.MAX_VALUE。这种情况如果任务量提交较多,容易造成任务积压,造成OOM
写个demo验证下上述的推测,从debug结果上也论证了上述的代码分析
6、内存配置
暂不需要调整,容器发布建议设置Xmx
7、解决方案
通过问题分析得知,@Async不能使用默认SimpleAsyncTaskExecutor,即使spring boot重新设置了线程池也不合符要求,使用自定义线程池控制并发度、队列等策略避免潜在的OOM风险。(具体提交的任务为什么执行时长较长,内存无法回收,不在这里讨论)
@Bean
public Executor taskExecutor() {
ThreadPoolTaskExecutor pool = new ThreadPoolTaskExecutor();
pool.setCorePoolSize(5); //线程池活跃的线程数
pool.setMaxPoolSize(10); //线程池最大活跃的线程数
pool.setQueueCapacity(50);
pool.setWaitForTasksToCompleteOnShutdown(true);
pool.setThreadNamePrefix("async-executor-setserver");
return pool;
8、预防
接入使用java-agent等监控工具尽早暴露问题、分析问题、解决问题
关于线程和线程池的使用,明确使用规范并应用
idea引入代码规范插件,自动化检查、提前预防,不把简单问题带到线上,增加运维成本