Spark-用IDEA编写wordcount demo

这篇具有很好参考价值的文章主要介绍了Spark-用IDEA编写wordcount demo。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

配置

Spark版本:3.2.0

Scala版本:2.12.12

JDK:1.8

Maven:3.6.3

Spark-用IDEA编写wordcount demo,Spark,spark,大数据,分布式

pom文件1

Spark版本:3.2.0

Scala版本:2.12.12

JDK:1.8

Maven:3.6.3


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.zzjz.Spark</groupId>
    <artifactId>Spark</artifactId>
    <version>1.0</version>

    <properties>
        <spark.version>3.2.0</spark.version>
        <scala.version>2.12</scala.version>
    </properties>


    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

    </dependencies>

    <build>
        <plugins>

            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.19</version>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>

        </plugins>
    </build>






</project>

POM文件2

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.example</groupId>
  <artifactId>sparklx</artifactId>
  <version>1.0</version>

  <name>sparklx</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <scala.version>2.12.12</scala.version> <!-- 将 Scala 版本更新为 2.12.12 -->
    <spark.version>2.4.3</spark.version> <!-- 保持原始的 Spark 版本 -->
  </properties>



  <dependencies>
    <!-- ... 其他依赖 ... -->
    <!-- 导入 Scala 和 Spark 的依赖 -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>${spark.version}</version>
    </dependency>
  </dependencies>





  <build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.1.0</version>
        </plugin>
        <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.22.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>
        <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
        <plugin>
          <artifactId>maven-site-plugin</artifactId>
          <version>3.7.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-project-info-reports-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
        <!-- Scala compiler plugin -->
        <plugin>
          <groupId>net.alchim31.maven</groupId>
          <artifactId>scala-maven-plugin</artifactId>
          <version>4.4.0</version>
          <executions>
            <execution>
              <goals>
                <goal>compile</goal>
                <goal>testCompile</goal>
              </goals>
            </execution>
          </executions>
        </plugin>

      </plugins>
    </pluginManagement>
  </build>


</project>

样例数据

9422850591,11603,39939,山西,邮件,人员
9422850591,116427,39911,山西,邮件,人员
9422850591,116437,39895,山西,邮件,人员

代码

import org.apache.spark.{SparkConf, SparkContext}  // 导入SparkConf和SparkContext类
object wcPerson {
  def main (args:Array[String]): Unit ={
    // 创建SparkConf对象,设置应用程序名称为"wcPerson",运行模式为本地模式,使用一个CPU核心
    val conf = new SparkConf().setAppName("wcPerson").setMaster("local[1]")  
    // 创建SparkContext对象,与Spark集群进行通信
    val sc = new SparkContext(conf) 
    // 加载文件,将每一行作为一个字符串元素,返回一个RDD 
    val inputFile = sc.textFile("D:\\workspace\\spark\\src\\main\\Data\\person")  
    // 对RDD应用flatMap转换操作,将每一行按","分割成多个单词,并将所有单词扁平化为一个RDD
    val wc = inputFile.flatMap(line => line.split(","))  
    // 对RDD应用map转换操作,将每个单词映射为(key, value)的元组,其中key为单词本身,value为1
      .map(word => (word,1)) 
    // 对相同key的元组进行聚合操作,将相同key的value相加 
      .reduceByKey((a,b) => a + b) 
    // 打印输出聚合结果 
    wc.foreach(println)  
  }
}

运行结果

Spark-用IDEA编写wordcount demo,Spark,spark,大数据,分布式文章来源地址https://www.toymoban.com/news/detail-550437.html

D:\Java\jdk1.8.0_131\bin\java.exe "-javaagent:D:\idea\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=52283:D:\idea\IntelliJ IDEA 2021.1.3\bin" -Dfile.encoding=UTF-8 -classpath "D:\idea\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar" com.intellij.rt.execution.CommandLineWrapper C:\Users\Administrator\AppData\Local\Temp\idea_classpath1156784809 wcPerson
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/D:/spark/spark-3.2.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/D:/Maven/Maven_repositories/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/07/11 10:00:01 INFO SparkContext: Running Spark version 3.2.0
23/07/11 10:00:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/07/11 10:00:02 INFO ResourceUtils: ==============================================================
23/07/11 10:00:02 INFO ResourceUtils: No custom resources configured for spark.driver.
23/07/11 10:00:02 INFO ResourceUtils: ==============================================================
23/07/11 10:00:02 INFO SparkContext: Submitted application: wcPerson
23/07/11 10:00:02 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/07/11 10:00:02 INFO ResourceProfile: Limiting resource is cpu
23/07/11 10:00:02 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/07/11 10:00:02 INFO SecurityManager: Changing view acls to: Administrator
23/07/11 10:00:02 INFO SecurityManager: Changing modify acls to: Administrator
23/07/11 10:00:02 INFO SecurityManager: Changing view acls groups to: 
23/07/11 10:00:02 INFO SecurityManager: Changing modify acls groups to: 
23/07/11 10:00:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(Administrator); groups with view permissions: Set(); users  with modify permissions: Set(Administrator); groups with modify permissions: Set()
23/07/11 10:00:07 INFO Utils: Successfully started service 'sparkDriver' on port 52323.
23/07/11 10:00:07 INFO SparkEnv: Registering MapOutputTracker
23/07/11 10:00:07 INFO SparkEnv: Registering BlockManagerMaster
23/07/11 10:00:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/07/11 10:00:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/07/11 10:00:07 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/07/11 10:00:07 INFO DiskBlockManager: Created local directory at C:\Users\Administrator\AppData\Local\Temp\blockmgr-0052575b-7c9f-457e-9ed5-fb50af59f965
23/07/11 10:00:07 INFO MemoryStore: MemoryStore started with capacity 623.4 MiB
23/07/11 10:00:07 INFO SparkEnv: Registering OutputCommitCoordinator
23/07/11 10:00:07 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/07/11 10:00:08 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://zzjz:4040
23/07/11 10:00:08 INFO Executor: Starting executor ID driver on host zzjz
23/07/11 10:00:08 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52339.
23/07/11 10:00:08 INFO NettyBlockTransferService: Server created on zzjz:52339
23/07/11 10:00:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/07/11 10:00:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, zzjz, 52339, None)
23/07/11 10:00:08 INFO BlockManagerMasterEndpoint: Registering block manager zzjz:52339 with 623.4 MiB RAM, BlockManagerId(driver, zzjz, 52339, None)
23/07/11 10:00:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, zzjz, 52339, None)
23/07/11 10:00:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, zzjz, 52339, None)
23/07/11 10:00:10 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 244.0 KiB, free 623.2 MiB)
23/07/11 10:00:10 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KiB, free 623.1 MiB)
23/07/11 10:00:10 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on zzjz:52339 (size: 23.4 KiB, free: 623.4 MiB)
23/07/11 10:00:10 INFO SparkContext: Created broadcast 0 from textFile at wcPerson.scala:7
23/07/11 10:00:10 INFO FileInputFormat: Total input paths to process : 1
23/07/11 10:00:10 INFO SparkContext: Starting job: foreach at wcPerson.scala:10
23/07/11 10:00:11 INFO DAGScheduler: Registering RDD 3 (map at wcPerson.scala:9) as input to shuffle 0
23/07/11 10:00:11 INFO DAGScheduler: Got job 0 (foreach at wcPerson.scala:10) with 1 output partitions
23/07/11 10:00:11 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at wcPerson.scala:10)
23/07/11 10:00:11 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
23/07/11 10:00:11 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
23/07/11 10:00:11 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at wcPerson.scala:9), which has no missing parents
23/07/11 10:00:11 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.9 KiB, free 623.1 MiB)
23/07/11 10:00:11 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.0 KiB, free 623.1 MiB)
23/07/11 10:00:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on zzjz:52339 (size: 4.0 KiB, free: 623.4 MiB)
23/07/11 10:00:11 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1427
23/07/11 10:00:11 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at wcPerson.scala:9) (first 15 tasks are for partitions Vector(0))
23/07/11 10:00:11 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
23/07/11 10:00:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (zzjz, executor driver, partition 0, PROCESS_LOCAL, 4503 bytes) taskResourceAssignments Map()
23/07/11 10:00:11 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
23/07/11 10:00:12 INFO HadoopRDD: Input split: file:/D:/workspace/spark/src/main/Data/person:0+135
23/07/11 10:00:12 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1325 bytes result sent to driver
23/07/11 10:00:12 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1022 ms on zzjz (executor driver) (1/1)
23/07/11 10:00:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
23/07/11 10:00:12 INFO DAGScheduler: ShuffleMapStage 0 (map at wcPerson.scala:9) finished in 1.263 s
23/07/11 10:00:12 INFO DAGScheduler: looking for newly runnable stages
23/07/11 10:00:12 INFO DAGScheduler: running: Set()
23/07/11 10:00:12 INFO DAGScheduler: waiting: Set(ResultStage 1)
23/07/11 10:00:12 INFO DAGScheduler: failed: Set()
23/07/11 10:00:12 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at wcPerson.scala:9), which has no missing parents
23/07/11 10:00:12 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.3 KiB, free 623.1 MiB)
23/07/11 10:00:12 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.1 KiB, free 623.1 MiB)
23/07/11 10:00:12 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on zzjz:52339 (size: 3.1 KiB, free: 623.4 MiB)
23/07/11 10:00:12 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1427
23/07/11 10:00:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at wcPerson.scala:9) (first 15 tasks are for partitions Vector(0))
23/07/11 10:00:12 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks resource profile 0
23/07/11 10:00:12 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (zzjz, executor driver, partition 0, NODE_LOCAL, 4271 bytes) taskResourceAssignments Map()
23/07/11 10:00:12 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
23/07/11 10:00:12 INFO ShuffleBlockFetcherIterator: Getting 1 (142.0 B) non-empty blocks including 1 (142.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/07/11 10:00:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 20 ms
(116437,1)
(39911,1)
(116427,1)
(9422850591,3)
(39895,1)
(山西,3)
(11603,1)
(39939,1)
(人员,3)
(邮件,3)
23/07/11 10:00:12 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1224 bytes result sent to driver
23/07/11 10:00:12 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 123 ms on zzjz (executor driver) (1/1)
23/07/11 10:00:12 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
23/07/11 10:00:12 INFO DAGScheduler: ResultStage 1 (foreach at wcPerson.scala:10) finished in 0.153 s
23/07/11 10:00:12 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
23/07/11 10:00:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
23/07/11 10:00:12 INFO DAGScheduler: Job 0 finished: foreach at wcPerson.scala:10, took 2.044775 s
23/07/11 10:00:12 INFO SparkContext: Invoking stop() from shutdown hook
23/07/11 10:00:12 INFO SparkUI: Stopped Spark web UI at http://zzjz:4040
23/07/11 10:00:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/07/11 10:00:12 INFO MemoryStore: MemoryStore cleared
23/07/11 10:00:12 INFO BlockManager: BlockManager stopped
23/07/11 10:00:12 INFO BlockManagerMaster: BlockManagerMaster stopped
23/07/11 10:00:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/07/11 10:00:12 INFO SparkContext: Successfully stopped SparkContext
23/07/11 10:00:12 INFO ShutdownHookManager: Shutdown hook called
23/07/11 10:00:12 INFO ShutdownHookManager: Deleting directory C:\Users\Administrator\AppData\Local\Temp\spark-9f3eb32d-30f7-44d6-8751-f668b2710d89

Process finished with exit code 0

到了这里,关于Spark-用IDEA编写wordcount demo的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 5 | Java Spark WordCount打成Jar 包测试

    步骤 1:准备 WordCount 代码 首先,确保 编写了 WordCount 代码,已经提供了正确的输入文件路径。

    2024年02月10日
    浏览(33)
  • Spark大数据分析与实战笔记(第三章 Spark RDD 弹性分布式数据集-02)

    人生很长,不必慌张。你未长大,我要担当。 传统的MapReduce虽然具有自动容错、平衡负载和可拓展性的优点,但是其最大缺点是采用非循环式的数据流模型,使得在迭代计算式要进行大量的磁盘IO操作。Spark中的RDD可以很好的解决这一缺点。 RDD是Spark提供的最重要的抽象概念

    2024年02月22日
    浏览(96)
  • Spark弹性分布式数据集

    1. Spark RDD是什么 RDD(Resilient Distributed Dataset,弹性分布式数据集)是一个不可变的分布式对象集合,是Spark中最基本的数据抽象。在代码中RDD是一个抽象类,代表一个弹性的、不可变、可分区、里面的元素可并行计算的集合。 每个RDD都被分为多个分区,这些分区运行在集群中

    2024年02月13日
    浏览(60)
  • spark stream入门案例:netcat准实时处理wordCount(scala 编程)

    目录 案例需求 代码 结果 解析          案例需求:         使用netcat工具向9999端口不断的发送数据,通过SparkStreaming读取端口数据并统计不同单词出现的次数         -- 1. Spark从socket中获取数据:一行一行的获取         -- 2. Driver程序执行时,streaming处理过程

    2024年02月07日
    浏览(41)
  • Spark—通过Java、Scala API实现WordCount案例的基本操作

    实验原理 Spark的核心就是RDD,所有在RDD上的操作会被运行在Cluster上,Driver程序启动很多Workers,Workers在(分布式)文件系统中读取数据后转化为RDD(弹性分布式数据集),然后对RDD在内存中进行缓存和计算。 而RDD有两种类型的操作 ,分别是Action(返回values)和Transformations(

    2024年02月15日
    浏览(44)
  • 大数据学习06-Spark分布式集群部署

    配置好IP vim /etc/sysconfig/network-scripts/ifcfg-ens33 修改主机名 vi /etc/hostname 做好IP映射 vim /etc/hosts 关闭防火墙 systemctl status firewalld systemctl stop firewalld systemctl disable firewalld 配置SSH免密登录 ssh-keygen -t rsa 下载Scala安装包 配置环境变量 添加如下配置 使环境生效 验证 Spark官网 解压 上

    2024年02月10日
    浏览(70)
  • 大数据开发之Spark(RDD弹性分布式数据集)

    rdd(resilient distributed dataset)叫做弹性分布式数据集,是spark中最基本的数据抽象。 代码中是一个抽象类,它代表一个弹性的、不可变、可分区、里面的元素可并行计算的集合。 1.1.1 rdd类比工厂生产 1.1.2 wordcount工作流程 1、一组分区(partition),即是数据集的基本组成单位,

    2024年01月24日
    浏览(71)
  • 大数据课程K2——Spark的RDD弹性分布式数据集

    文章作者邮箱:yugongshiye@sina.cn              地址:广东惠州 ⚪ 了解Spark的RDD结构; ⚪ 掌握Spark的RDD操作方法; ⚪ 掌握Spark的RDD常用变换方法、常用执行方法; 初学Spark时,把RDD看做是一个集合类型(类似于Array或List),用于存储数据和操作数据,但RDD和普通集合的区别

    2024年02月12日
    浏览(55)
  • 大数据开源框架环境搭建(七)——Spark完全分布式集群的安装部署

    前言:七八九用于Spark的编程实验 大数据开源框架之基于Spark的气象数据处理与分析_木子一个Lee的博客-CSDN博客_spark舆情分析 目录 实验环境: 实验步骤: 一、解压 二、配置环境变量:  三、修改配置文件  1.修改spark-env.sh配置文件: 2.修改配置文件slaves: 3.分发配置文件:

    2024年02月11日
    浏览(54)
  • 云计算与大数据第16章 分布式内存计算平台Spark习题

    1、Spark是Hadoop生态(  B  )组件的替代方案。 A. Hadoop     B. MapReduce        C. Yarn             D.HDFS 2、以下(  D  )不是Spark的主要组件。 A. Driver      B. SparkContext       C. ClusterManager D. ResourceManager 3、Spark中的Executor是(  A  )。 A.执行器      B.主节

    2024年02月14日
    浏览(119)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包