问题
早上过来发现定时任务出现告警,Flink Jobs运行失败,登录Flinkweb后台一看,所有jobs都没了,slot也为0。
查看Flink日志,有以下错误异常:
2022-12-07 08:00:05,444 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error occurred while executing the TaskManager. Shutting it down...
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_191]
at java.lang.ClassLoader.defineClass(ClassLoader.java:763) ~[?:1.8.0_191]
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_191]
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_191]
at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_191]
at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_191]
at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_191]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_191]
at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) [flink-dist_2.12-1.13.6.jar:1.13.6]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) [?:1.8.0_191]
at java.lang.invoke.MethodHandleNatives.resolve(Native Method) ~[?:1.8.0_191]
at java.lang.invoke.MemberName$Factory.resolve(MemberName.java:975) [?:1.8.0_191]
at java.lang.invoke.MemberName$Factory.resolveOrFail(MemberName.java:1000) [?:1.8.0_191]
at java.lang.invoke.MethodHandles$Lookup.resolveOrFail(MethodHandles.java:1394) [?:1.8.0_191]
at java.lang.invoke.MethodHandles$Lookup.linkMethodHandleConstant(MethodHandles.java:1750) [?:1.8.0_191]
at java.lang.invoke.MethodHandleNatives.linkMethodHandleConstant(MethodHandleNatives.java:477) [?:1.8.0_191]
at org.apache.poi.xssf.usermodel.XSSFRelation.<clinit>(XSSFRelation.java:124) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at org.apache.poi.xssf.usermodel.XSSFWorkbookType.<clinit>(XSSFWorkbookType.java:26) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:247) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.WorkbookUtil.createBook(WorkbookUtil.java:133) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.WorkbookUtil.createBookForWriter(WorkbookUtil.java:73) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.ExcelWriter.<init>(ExcelWriter.java:145) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.ExcelWriter.<init>(ExcelWriter.java:135) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.ExcelUtil.getWriter(ExcelUtil.java:418) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at com.ucloud.provider.flink.imsafe.job.FlinkJobCheatFind.operateResult(FlinkJobCheatFind.java:276) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at com.ucloud.provider.flink.imsafe.job.FlinkJobCheatFind$2.close(FlinkJobCheatFind.java:149) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:247) [flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779) [flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) [flink-dist_2.12-1.13.6.jar:1.13.6]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
2022-12-07 08:00:05,445 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache
2022-12-07 08:00:05,445 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache
2022-12-07 08:00:05,446 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-4b019f58-a9a8-49ce-9429-41054270ef41
2022-12-07 08:00:06,134 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
分析
根据错误异常不难得出,是因为metaspace内存溢出导致的。
通过日志能观察到是一个批处理任务(FlinkJobCheatFind)导致;这个批处理任务是通过一个定时任务中心进行调度。
问题大概的地方知道了,但是为什么会导致内存泄漏?
首先我们要了解metasapce内存是啥,一般我们开发java程序的时候很多时候碰到的是heap space内存溢出,很少metaspace溢出;这里就涉及到java的运行时模型,java运行时就包含有metaspace和heap,metaspace(在jdk1.8之前叫perm)就是存储java类、静态变量等一些相对比较固定的信息,heap存放的是就是类创建的对象信息,是相对比较大的一块内容,两个区域存放的内容所采用的垃圾回收策略也不太一样。
了解到metaspace是啥,但是还是不知道为什么内存溢出?我们再看看日志,这里我们看到批处理任务使用到了poi的excel工具实现对excel的处理,这个工具类里面有大量的static静态变量数据,这里加载class和静态数据的是ChildFirstClassLoader的类加载器,ChildFirstClassLoader类加载器加载了过多的class和静态变量导致内存泄漏。
**那为什么ChildFirstClassLoader类加载器会导致内存泄漏?**这里我们就要去了解下flink的类加载器原理:
Flink有两种类加载器Parent-First和Child-First:
Parent-First:类似 Java 中的双亲委派的类加载机制。Parent First ClassLoader 实际的逻辑就是一个 URL ClassLoader。
Child-First:底层也是基于URL ClassLoader,但是会先用 classloader.parent-first-patterns.default 和 classloader.parent-first-patterns.additional 拼接的list做匹配,如果类名前缀匹配了,先走双亲委派。否则就用 ChildFirstClassLoader进行加载。
Child-First是默认的方式,standalone模式下每次执行批处理任务的时候就会生成一个ChildFirstClassLoader加载所有class,当任务结束后Flink会将ChildFirstClassLoader关闭释放掉;其实这里就有问题了,Flink关闭classloader只是调用了URLClassLoader的close方法,这个关闭只是将jar包的打开给关闭了,之前加载的class都还在,ClassLoader如果有其他引用,这个ChildFirstClassLoader就不会被释放掉。当多次运行批处理任务后就会出现元数据空间内存溢出。
大概原理都知道了,内存溢出位置大概也知道,但是到底是什么没有释放导致ChildFirstClassLoader一直存在呢?
这个得实际运行分下才能知道问题出在哪里,我们再测试环境跑几次批处理任务,登录web端查看metaspace,发现一直增长,没有回收;通过jmap命令将task的进程dump下来,用Memory Analyzer工具打开:
在Leak Suspects可以看到,这个ChildFirstClassLoader有4,408,232个bytes,明显不太正常:
在Histogram界面我们搜索下这个类:
可以看到有3个对象一直存在,具体是什么对象,我们通过查找引用去看:
通过查看引用可以看到,是mysql-cj-abandoned-connection-cleanup这个线程持有了这个loader。
这下真相大白了:批处理用到了数据库连接,数据库连接开启了一个cleanup线程导致ClassLoader一直不能释放。
解决方案
Flink提供了一个hook钩子服务,可以注册ClassLoader释放的动作,在ClassLoader释放之前做一些处理。
那我们就可以利用这个钩子来处理一些事情:
(这里的getRuntimeContext()可以再RichInput、RichSink和RichOutput等计算类中获取。)
log.info("注册钩子,用于释放一些依赖释放不了的类");
RuntimeContext ctx = this.getRuntimeContext();
ctx.registerUserCodeClassLoaderReleaseHookIfAbsent(FlinkJobCheatFind.class.getName() + "_clsreleasehook", new Runnable() {
@Override
public void run() {
log.info("release hook");
//release driver
Enumeration<Driver> drivers = DriverManager.getDrivers();
while(drivers.hasMoreElements()) {
Driver driver = drivers.nextElement();
try {
log.info("注销driver:{}", driver);
DriverManager.deregisterDriver(driver);
} catch (SQLException throwables) {
log.error("", throwables);
}
}
log.info("删除mysql的cleanup线程");
AbandonedConnectionCleanupThread.uncheckedShutdown();
}
});
这个钩子主要作用就是在classloader释放前将driver注销掉、将cleanupthread线程关闭掉,保证ChirdFirstClassLoader没有其他引用。
另外关于手动释放哪些,其实官网上也给出了一些说明:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/debugging/debugging_classloading/
主要是这一段:
Common causes for class leaks and suggested fixes:
Lingering Threads: Make sure the application functions/sources/sinks shuts down all threads. Lingering threads cost resources themselves and additionally typically hold references to (user code) objects, preventing garbage collection and unloading of the classes.
Interners: Avoid caching objects in special structures that live beyond the lifetime of the functions/sources/sinks. Examples are Guava’s interners, or Avro’s class/object caches in the serializers.
JDBC: JDBC drivers leak references outside the user code classloader. To ensure that these classes are only loaded once you should either add the driver jars to Flink’s lib/ folder, or add the driver classes to the list of parent-first loaded class via classloader.parent-first-patterns-additional.
测试验证
不断的运行批处理,查看metaspace内存情况。
这里我使用的是arthas检查内存,可以看到113M降到了79M,说明是有回收的。
文章来源:https://www.toymoban.com/news/detail-478534.html
其他方案
上面的解决方案其实还是有点麻烦的,如果项目有很多引用,那就很难判断具体要释放哪些。
还有两种方案:
1、采用yarn提交方式,yarn提交方式是临时启动docker启动job和task进程处理,直接销毁的是整个进程,所以不存在问题。
2、将依赖的jar包放到flink的lib目录下,jar包中只有基础的一些对象,这样就不存在每次启动都要额外加载类的情况。文章来源地址https://www.toymoban.com/news/detail-478534.html
到了这里,关于Flink批处理metaspace内存溢出问题的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!