Flink中FileSink的使用-Toy模板网

这篇具有很好参考价值的文章主要介绍了Flink中FileSink的使用。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

在Flink中提供了StreamingFileSink用以将数据流输出到文件系统.
这里结合代码介绍如何使用FileSink.
首先FileSink有两种模式forRowFormat和forBulkFormat

    public static <IN> DefaultRowFormatBuilder<IN> forRowFormat(
            final Path basePath, final Encoder<IN> encoder) {
        return new DefaultRowFormatBuilder<>(basePath, encoder, new DateTimeBucketAssigner<>());
    }

    public static <IN> DefaultBulkFormatBuilder<IN> forBulkFormat(
            final Path basePath, final BulkWriter.Factory<IN> bulkWriterFactory) {
        return new DefaultBulkFormatBuilder<>(
                basePath, bulkWriterFactory, new DateTimeBucketAssigner<>());
    }

二者的区别是forRowFormat是一行一行的处理数据,而forBulkFormat则是可以一次处理多条数据,而多条处理的好处就是可以帮助生成列式存储的文件如ParquetFile和ORCFile,而forRowFormat则做不到这点,关于列式存储和行式存储的区别可通过数据存储格式这篇文章简单做一个了解.

下面以forRowFormat作为示例演示一下代码

import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.configuration.MemorySize;
import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.OutputFileConfig;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.DateTimeBucketAssigner;
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy;

import java.time.Duration;

/**
 * @Author: J
 * @Version: 1.0
 * @CreateTime: 2023/6/27
 * @Description: 测试
 **/
public class FlinkFileSink {
    public static void main(String[] args) throws Exception {
        // 构建流环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置并行度为1
        env.setParallelism(1);
        // 这里是生成数据流,CustomizeSource这个类是自定义数据源(为了方便测试)
        DataStreamSource<CustomizeBean> dataStreamSource = env.addSource(new CustomizeSource());
        // 现将数据转换成字符串形式
        SingleOutputStreamOperator<String> map = dataStreamSource.map(bean -> bean.toString());
        // 构造FileSink对象,这里使用的RowFormat,即行处理类型的
        FileSink<String> fileSink = FileSink
                // 配置文件输出路径及编码格式
                .forRowFormat(new Path("/Users/xxx/data/testData/"), new SimpleStringEncoder<String>("UTF-8"))
                // 设置文件滚动策略(文件切换)
                .withRollingPolicy(
                        DefaultRollingPolicy.builder()
                                .withRolloverInterval(Duration.ofSeconds(180)) // 设置间隔时长180秒进行文件切换
                                .withInactivityInterval(Duration.ofSeconds(20)) // 文件20秒没有数据写入进行文件切换
                                .withMaxPartSize(MemorySize.ofMebiBytes(1)) // 设置文件大小1MB进行文件切换
                                .build()
                )
                // 分桶策略(划分子文件夹)
                .withBucketAssigner(new DateTimeBucketAssigner<String>()) // 按照yyyy-mm-dd--h进行分桶
                //设置分桶检查时间间隔为100毫秒
                .withBucketCheckInterval(100)
                // 输出文件文件名相关配置
                .withOutputFileConfig(
                        OutputFileConfig.builder()
                                .withPartPrefix("test_") // 文件前缀
                                .withPartSuffix(".txt") // 文件后缀
                                .build()
                )
                .build();
        // 输出到文件
        map.print();
        map.sinkTo(fileSink);
        env.execute();
    }
}

代码内容这里就不详细说明了,注释已经写得很清楚了.有一点要注意使用FileSink的时候我们要加上对应的pom依赖.我这里使用Flink版本是1.15.3

        <!-- File connector -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-avro</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-files</artifactId>
            <version>${flink.version}</version>
        </dependency>

这里我们先看一下生成的结果文件

-rw-r--r--  1 xxx  staff   1.0M  6 27 14:43 .test_-eb905337-488d-46f1-8177-86fbb46f778f-0.txt.inprogress.91e49c89-cc79-44f5-940d-ded2770b61a1
-rw-r--r--  1 xxx  staff   1.0M  6 27 14:44 .test_-eb905337-488d-46f1-8177-86fbb46f778f-1.txt.inprogress.c548bd30-8583-48d5-91d2-2e11a7dca2cd
-rw-r--r--  1 xxx  staff   1.0M  6 27 14:45 .test_-eb905337-488d-46f1-8177-86fbb46f778f-2.txt.inprogress.a041dba1-8f37-4307-82da-682c48b0796b
-rw-r--r--  1 xxx  staff   280K  6 27 14:45 .test_-eb905337-488d-46f1-8177-86fbb46f778f-3.txt.inprogress.e05d1759-0a38-4a25-bcd0-1216ce6dda59

这里有必要说明一下由于我使用的是Mac在生成文件的时候会出现一个小问题,上面的那种文件会隐藏起来,直接点开文件夹是看不到的可以通过command + shift + .来显示隐藏文件,或者像我这种直接通过终端ll -a来查看,windows没有发现这个问题.
可以看到除了最后一个文件,其他的文件大小基本都是1MB,最后一个是因为写入的数据大小还没有满足1MB,并且写入时间也没有满足滚动条件,所以还在持续写入中.
而且通过文件名我们可以看到所有文件中都带有inprogress这个状态,这是因为我们没有开启checkpoint,这里先说一下FileSink写入文件时的三个文件状态,官网原图如下:
forbulkformat,FLink,flink,大数据
这三种状态分别是inprogress、pending和finished,对应的就是处理中、挂起和完成,官网中同时也说明了FileSink必须和checkpoint配合使用,不然文件的状态只会出现inprogress和pending,原文内容如下:

下面我们在看一下加入checkpoint的代码和结果文件
代码如下

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.configuration.MemorySize;
import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.OutputFileConfig;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.DateTimeBucketAssigner;
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy;

import java.time.Duration;

/**
 * @Author: J
 * @Version: 1.0
 * @CreateTime: 2023/6/27
 * @Description: 测试
 **/
public class FlinkFileSink {
    public static void main(String[] args) throws Exception {
        // 构建流环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置并行度为1
        env.setParallelism(1);
        // 这里是生成数据流,CustomizeSource这个类是自定义数据源(为了方便测试)
        DataStreamSource<CustomizeBean> dataStreamSource = env.addSource(new CustomizeSource());
        // 现将数据转换成字符串形式
        SingleOutputStreamOperator<String> map = dataStreamSource.map(bean -> bean.toString());

        // 每20秒作为checkpoint的一个周期
        env.enableCheckpointing(20000);
        // 两次checkpoint间隔最少是10秒
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(10000);
        // 程序取消或者停止时不删除checkpoint
        env.getCheckpointConfig().setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        // checkpoint必须在60秒结束,否则将丢弃
        env.getCheckpointConfig().setCheckpointTimeout(60000);
        // 同一时间只能有一个checkpoint
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
        // 设置EXACTLY_ONCE语义,默认就是这个
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        // checkpoint存储位置
        env.getCheckpointConfig().setCheckpointStorage("file:///Users/xxx/data/testData/checkpoint");
        // 设置执行模型为Streaming方式
        env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
        
        // 构造FileSink对象,这里使用的RowFormat,即行处理类型的
        FileSink<String> fileSink = FileSink
                // 配置文件输出路径及编码格式
                .forRowFormat(new Path("/Users/xxx/data/testData/"), new SimpleStringEncoder<String>("UTF-8"))
                // 设置文件滚动策略(文件切换)
                .withRollingPolicy(
                        DefaultRollingPolicy.builder()
                                .withRolloverInterval(Duration.ofSeconds(180)) // 设置间隔时长180秒进行文件切换
                                .withInactivityInterval(Duration.ofSeconds(20)) // 文件20秒没有数据写入进行文件切换
                                .withMaxPartSize(MemorySize.ofMebiBytes(1)) // 设置文件大小1MB进行文件切换
                                .build()
                )
                // 分桶策略(划分子文件夹)
                .withBucketAssigner(new DateTimeBucketAssigner<String>()) // 按照yyyy-mm-dd--h进行分桶
                //设置分桶检查时间间隔为100毫秒
                .withBucketCheckInterval(100)
                // 输出文件文件名相关配置
                .withOutputFileConfig(
                        OutputFileConfig.builder()
                                .withPartPrefix("test_") // 文件前缀
                                .withPartSuffix(".txt") // 文件后缀
                                .build()
                )
                .build();
        // 输出到文件
        map.print();
        map.sinkTo(fileSink);
        env.execute();
    }
}

看一下结果文件:

-rw-r--r--  1 xxx  staff   761K  6 27 15:13 .test_-96ccd42e-716d-4ee0-835e-342618914e7d-2.txt.inprogress.aa5fccaa-f99f-4059-93e7-6d3c548a66b3
-rw-r--r--  1 xxx  staff   1.0M  6 27 15:11 test_-96ccd42e-716d-4ee0-835e-342618914e7d-0.txt
-rw-r--r--  1 xxx  staff   1.0M  6 27 15:12 test_-96ccd42e-716d-4ee0-835e-342618914e7d-1.txt