Flink中遇到的问题

这篇具有很好参考价值的文章主要介绍了Flink中遇到的问题。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

目录

1、提交flink 批处理任务时遇到的问题

2、flink定时任务,mysql连接超时问题

3、yarn 增加并行任务数量配置

4、flink checkpoint 恢复失败 

5、flink程序在hadoop集群跑了一段时间莫名挂掉 

1、提交flink 批处理任务时遇到的问题

问题描述:

最近写了一个flink批处理程序, 目的是读取hdfs文件,将文件数据写入到hbase。
项目是在idea中开发的 pom 文件如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>hdfs-flink</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <flink.version>1.11.2</flink.version>
        <scala.binary.version>2.12</scala.binary.version>
    </properties>

    <dependencies>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>2.0.0-alpha1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.4</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-hadoop-compatibility_2.11</artifactId>
            <version>1.11.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.18</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.17</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.2.6</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <!--  指定模块打包-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <artifactSet>
                        <excludes>
                            <exclude>com.google.code.findbugs:jsr305</exclude>
                            <exclude>org.slf4j:*</exclude>
                            <exclude>log4j:*</exclude>
                        </excludes>
                    </artifactSet>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <!--需要加入编译信息过滤,不然不知名错误-->
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                                <exclude>META-INF/*.DSA</exclude>

                            </excludes>
                        </filter>
                    </filters>

                    <transformers>
                        <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                            <mainClass>xxx.Format</mainClass>
                        </transformer>
                    </transformers>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

将代码写完之后发现在idea中运行是没有问题,但是将项目打包后上传到集群,使用Job方式提交任务的时候报错,控制台的详细信息如下:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/flink/flink-1.11.2/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/hadoop-3.1.4/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2023-02-24 13:52:11,153 WARN  org.apache.flink.yarn.configuration.YarnLogConfigUtil        [] - The configuration directory ('/usr/lib/flink/flink-1.11.2/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file.
2023-02-24 13:52:11,179 INFO  org.apache.hadoop.yarn.client.RMProxy                        [] - Connecting to ResourceManager at hadoop01/192.168.1.82:8032
2023-02-24 13:52:11,306 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2023-02-24 13:52:11,372 INFO  org.apache.hadoop.conf.Configuration                         [] - resource-types.xml not found
2023-02-24 13:52:11,372 INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils           [] - Unable to find 'resource-types.xml'.
2023-02-24 13:52:11,410 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=1024, slotsPerTaskManager=1}
2023-02-24 13:52:14,619 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Submitting application master application_1614083574571_0023
2023-02-24 13:52:14,653 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl        [] - Submitted application application_1614083574571_0023
2023-02-24 13:52:14,653 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Waiting for the cluster to be allocated
2023-02-24 13:52:14,654 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Deploying cluster, current state ACCEPTED
org.apache.flink.client.deployment.ClusterDeploymentException: Could not deploy Yarn job cluster.
 at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:431)
 at org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70)
 at org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:973)
 at org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:124)
 at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:72)
 at com.spaceon.hys.argos.set.Format.main(Format.java:66)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)
 at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)
 at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)
 at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:699)
 at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:232)
 at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916)
 at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
 at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
 at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 
Diagnostics from YARN: Application application_1614083574571_0023 failed 4 times in previous 10000 milliseconds due to AM Container for appattempt_1614083574571_0023_000004 exited with  exitCode: 1
Failing this attempt.Diagnostics: [2023-02-24 13:52:25.722]Exception from container-launch.
Container id: container_1614083574571_0023_04_000001
Exit code: 1

[2023-02-24 13:52:25.723]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

[2023-02-24 13:52:25.724]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

For more detailed output, check the application tracking page: http://hadoop01:8088/cluster/app/application_1614083574571_0023 Then click on links to logs of each attempt.
. Failing the application.
If log aggregation is enabled on your cluster, use this command to further investigate the issue:
yarn logs -applicationId application_1614083574571_0023
 at org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1021)
 at org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:524)
 at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:424)
 ... 21 more
2023-02-24 13:52:26,086 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Cancelling deployment from Deployment Failure Hook
2023-02-24 13:52:26,087 INFO  org.apache.hadoop.yarn.client.RMProxy                        [] - Connecting to ResourceManager at hadoop01/192.168.1.82:8032
2023-02-24 13:52:26,088 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Killing YARN application
2023-02-24 13:52:26,097 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl        [] - Killed application application_1614083574571_0023
2023-02-24 13:52:26,097 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Deleting files in hdfs://hadoop01:9000/user/root/.flink/application_1614083574571_0023.

原因分析:

1. 从控制台的日志好像发现不了什么关键的问题,只是报了一个 Container 加载失败,然后让你去yarn的首页去查看日志,之后我也查看了相关日志信息,但都没有发现什么有价值的信息。

2. 于是考虑是不是自己jar包的问题,之后使用各种打包的方式都没能解决问题。

3. 最后就只能逐个排除出问题的jar包,最后发现去掉 hbase-client 的包后程序正常执行,经过查看hbase-client 的依赖发现 hbase-client 包里使用的hadoop相关的依赖都是2.8.x,但是我的hadoop集群是3.1.4 版本。于是判断可能是hadoop版本冲突了。


解决方案:
去除hbase-client的包里 hadoop 相关依赖,其中有些包在 hadoop-client 中是包含的,但是不能覆盖hbase-client里的,通过 maven-shade-plugin打包也是不能解决的


具体操作在 pom 文件修改hbase-client 依赖:

<dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-server</artifactId>
        <version>2.4.1</version>
        <exclusions>
            <exclusion>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-annotations</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-auth</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-mapreduce-client-core</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
            </exclusion>
        </exclusions>
</dependency>

2、flink定时任务,mysql连接超时问题

问题描述:
前一天将flink任务提交到集群,第二天在web界面发现如下异常:

2023-02-25 09:11:40
com.mysql.cj.jdbc.exceptions.CommunicationsException: The last packet successfully received from the server was 63,745,201 milliseconds ago.  The last packet sent successfully to the server was 63,745,201 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
    at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
    at com.mysql.cj.jdbc.ConnectionImpl.isReadOnly(ConnectionImpl.java:1412)
    at com.mysql.cj.jdbc.ConnectionImpl.isReadOnly(ConnectionImpl.java:1390)
    at com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchInternal(ClientPreparedStatement.java:403)
    at com.mysql.cj.jdbc.StatementImpl.executeBatch(StatementImpl.java:796)
    at com.spaceon.hys.quality.function.sink.SinkQualityResult.invoke(SinkQualityResult.java:74)
    at com.spaceon.hys.quality.function.sink.SinkQualityResult.invoke(SinkQualityResult.java:22)
    at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
    at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:161)
    at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.processElement(StreamTaskNetworkInput.java:178)
    at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:153)
    at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: The last packet successfully received from the server was 63,745,201 milliseconds ago.  The last packet sent successfully to the server was 63,745,201 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
    at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
    at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
    at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167)
    at com.mysql.cj.protocol.a.NativeProtocol.send(NativeProtocol.java:572)
    at com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:633)
    at com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:130)
    at com.mysql.cj.NativeSession.sendCommand(NativeSession.java:317)
    at com.mysql.cj.NativeSession.queryServerVariable(NativeSession.java:1046)
    at com.mysql.cj.jdbc.ConnectionImpl.isReadOnly(ConnectionImpl.java:1398)
    ... 18 more
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at com.mysql.cj.protocol.a.SimplePacketSender.send(SimplePacketSender.java:55)
    at com.mysql.cj.protocol.a.TimeTrackingPacketSender.send(TimeTrackingPacketSender.java:50)
    at com.mysql.cj.protocol.a.NativeProtocol.send(NativeProtocol.java:563)
    ... 23 more

原因分析:

因为在flink任务里有一个按天统计功能,通过flink的定时器完成每天输出一个统计结果,但是在初始化mysql连接的后,也就是启动flink任务后,过24小时才会有该链接的更新操作,但mysql默认的连接等待时长为8小时,在建立连接后过8小时候没有进行任何查询操作,mysql将自动关闭连接,这才导致mysql连接失效


解决方案:
网上的解决方案大致两种
1. 更新mysql的配置,重新设置连接的等待时间

Flink中遇到的问题

 2. 连接加&autoReconnect=true属性(mysql5.x无效,亲测)

我是通过代码层来重新建立连接的

因为不想再去动mysql的默认配置,直接catch CommunicationsException ,若抛此异常表示连接断开了,然后重新建立连接

try {
    execUpdate(value);
    System.out.println("mysql 连接未断开更新");
}catch (CommunicationsException e){
    connection = ConnectionUtil.getConnection();
    execUpdate(value);
    System.out.println("mysql 连接断开重连更新");
}

3、yarn 增加并行任务数量配置

问题描述:
由于项目的要求,需要在yarn集群运行三个以上的flink流处理程序,外加一部分批处理程序,今天测试的时候发现部署了三个flink流处理程序后,内存,cpu资源都没有占满还有大部分的空间,但是提交flink批处理程序的时候,批处理程序一直处于挂起状态,不执行,大概想了一下应该就是资源分配的问题


解决方案:
修改hadoop配置文件:capacity-scheduler.xml
yarn.scheduler.capacity.maximum-am-resource-percent

<property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    #该值默认为0.1,将其修改为0.2
    <value>0.2</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run
      application masters i.e. controls number of concurrent running
      applications.
    </description>
</property>

该值的描述信息意为,群集中可用于运行应用程序主机的资源的最大百分比,即控制并发运行的应用程序的数量。

根据应用场景适当的增大该值即可提高应用的并发数量。

4、flink checkpoint 恢复失败 

异常信息:

org.apache.flink.util.StateMigrationException: For heap backends, the new state serializer must not be incompatible.
 at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend.tryRegisterStateTable(HeapKeyedStateBackend.java:230) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend.createInternalState(HeapKeyedStateBackend.java:273) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.runtime.state.KeyedStateFactory.createInternalState(KeyedStateFactory.java:47) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.runtime.state.ttl.TtlStateFactory.createStateAndWrapWithTtlIfEnabled(TtlStateFactory.java:72) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.runtime.state.AbstractKeyedStateBackend.getOrCreateKeyedState(AbstractKeyedStateBackend.java:279) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.getOrCreateKeyedState(StreamOperatorStateHandler.java:245) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.api.operators.AbstractStreamOperator.getOrCreateKeyedState(AbstractStreamOperator.java:435) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.open(WindowOperator.java:240) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:291) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) ~[preprocessing-flink-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
 at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_65]

解决方案:
先看 cheackpoint 页面信息

Flink中遇到的问题

再查看hdfs,会发现只有最近的一条checkpoint被保留下来,所以恢复的参数必须是最后一条存在的checkpoint记录

/usr/lib/flink/flink/bin/flink run -d -m yarn-cluster -yjm 1024 -ytm 1024 -s hdfs://hadoop01:9000/flink/checkpoints/preprocess/60366092809bbc2b5785591f8014f759/chk-815 /usr/lib/flink/jars/test/xxx.jar

5、flink程序在hadoop集群跑了一段时间莫名挂掉 

问题描述:
一个左右前在hadoop集群跑了一个flink流处理任务,有一天检查的时候发现挂掉了,信息如下:

Flink中遇到的问题

Flink中遇到的问题

发现flink所有的运行日志随hadoop的容器消失了

原因分析:
第一时间想到查看hadoop的日志,大概会有些帮助,进而定位问题,先后查看了namenode日志,datanode日志发现并无异常,然后查看nodemanager的日志时发现了端倪

Flink中遇到的问题

(由于日志文件比较大在linux终端查看不方便所以下载下来,使用Notepad++查看,查看时在hadoop任务界面找到任务失败的时间,下载该时间点的日志文件,就可以找到对应日志信息)
从日志信息可以发现,由于找不到进程id,所以容器直接关掉了退出了,于是根据日志进去查看了linux系统下的tmp目录,最后发现果然有/tmp/hadoop-root/*一系列目录。

对于hadoop容器的进程id这么重要的东西怎么能放在/tmp目录呢,tmp目录众所周知在文件不修改的情况下,默认30天会被清理,那么问题大概就出现在这里了,运行flink任务的container 的进程id被系统清理了

解决方案:
通过分析后,查看了一下hadoop临时目录的配置

Flink中遇到的问题

 可以发现用到该配置的地方竟然有13处之多,当然也包含了container的进程id的信息存放。

因此只需将hadoop.tmp.dir的配置更改一下就可以了,更改时修改core-site.xml文件,添加该配置项,具体的目录根据自己系统空间去配置。文章来源地址https://www.toymoban.com/news/detail-454107.html

<property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hadoop/tmp</value>
</property>

到了这里,关于Flink中遇到的问题的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 【各种**问题系列】‘vite’ 不是内部或外部命令,也不是可运行的程序或批处理文件

            尝试使用Vite来构建前端项目,在创建前端项目时,执行启动构建项目时控制台报错出现:         使用vite创建vue3.0项目的时候,vite不会自动 install 相关的依赖,需要我们手动去安装。进入项目的node_modules目录里面查看,什么都没有!!!         所以说出现

    2024年02月11日
    浏览(57)
  • 批处理命令大全 | Windows批处理教程 - ChatGPT

    批处理以.bat或.cmd文件的形式存在,在Windows命令提示符下运行,也可以通过双击批处理文件来运行。批处理文件由一系列命令组成,可以按照顺序执行,也可以根据条件或循环控制选择性地执行。 在Windows上创建一个批处理文件非常简单,在编辑器中输入一系列命令并保存为

    2024年02月04日
    浏览(81)
  • 如何解决 “node --max-old-space-size=10240“ 不是内部或外部命令,也不是可运行的程序或批处理文件。报错问题

      一、FATAL ERROR: Zone Allocation failed - process out of memory      今天早上起来打包vue项目时发现一直打包不了,还报了一个 数据错误:区域分配失败-进程内存不足; 排查之后发现是: node内存溢出的问题。 为了解决这个问题,我运行了2个命令之后就可以打包了 1、首先全局下载

    2024年02月16日
    浏览(38)
  • redis批处理优化

    一个命令在网络传输的时间往往是远大于在redis中执行命令的时间的,如果每条命令都要逐条经历网络传输,耗时将会大大增加,我们不妨将命令多量少次的传输给redis,这样就大大减少了因为网络传输时间,大大提高的效率 2.1.单机模式下的批处理 2.2.集群模式下的批处理 这

    2024年01月19日
    浏览(43)
  • Windows批处理

    @ echo off :关闭命令的回显功能,这样在执行脚本时不会显示每条命令的具体执行过程。建议将此行放在批处理脚本的首行。 rem :用于添加注释,后面可以跟上注释内容。注释的作用是对脚本进行说明或提醒,不会被执行。 pause :暂停批处理的运行,直到用户按下任意键才

    2024年02月07日
    浏览(49)
  • JDBC p4 批处理

    基本介绍: 当需要成批插入或者更新记录时。可以采用Java的批量更新机制,这一机制允许多条语句一次性提交给数据库批量处理。通常情况下比单独提交处理更有效率。 JDBC的批量处理语句包括下面方法: addBatch():添加需要批量处理的SQL语句或参数; executeBatch():执行批量

    2024年02月15日
    浏览(39)
  • 【bat】批处理脚本大全

    目录 1.概述 2.变量 3.运算符 3.2.重定向运算符 3.3.多命名运算符 3.4.管道运算符 4.命令 4.1.基本命令 4.2.参数传递 4.3.查看脚本内容 4.4.注释 4.5.日期和时间 4.6.启动脚本 4.7.调用其他bat 4.8.任务管理 4.8.1.任务列表查看 4.8.2.任务终止 4.9.文件夹 4.10.关机 4.11.环境变量 4.12.目录 4.12.1

    2024年02月04日
    浏览(53)
  • BAT 批处理脚本教程

    第一节 常用批处理内部命令简介 批处理定义:顾名思义,批处理文件是将一系列命令按一定的顺序集合为一个可执行的文本文件,其扩展名为BAT或者CMD。这些命令统称批处理命令。 小知识:可以在键盘上按下Ctrl+C组合键来强行终止一个批处理的执行过程。 了解了大概意思后

    2024年02月02日
    浏览(49)
  • 大数据处理平台的架构演进:从批处理到实时流处理

    🎈个人主页:程序员 小侯 🎐CSDN新晋作者 🎉欢迎 👍点赞✍评论⭐收藏 ✨收录专栏:大数据系列 ✨文章内容:大数据框架演进 🤝希望作者的文章能对你有所帮助,有不足的地方请在评论区留言指正,大家一起学习交流!🤗 大数据处理平台的架构演进经历了从批处理到实

    2024年02月10日
    浏览(45)
  • MyBatis批处理,使用foreach

    测试代码: 批量插入数据的局限性: 无法获取插入数据的id 批量生成的SQL太长,可能会被服务器拒绝 测试代码

    2024年02月07日
    浏览(38)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包