在Spark,两个DataFrame做join操作后,会出现重复的列。例如:
Dataset<Row> moviesWithRating = moviesDF
.join(averageRatingMoviesDF,
moviesDF.col("movieId").equalTo(averageRatingMoviesDF.col("movieId")));
其schema如下:
//moviesWithRating.printSchema();
/**
* root
* |-- _id: struct (nullable = true)
* | |-- oid: string (nullable = true)
* |-- actors: string (nullable = true)
* |-- description: string (nullable = true)
* |-- directors: string (nullable = true)
* |-- genres: string (nullable = true)
* |-- issue: string (nullable = true)
* |-- language: string (nullable = true)
* |-- movieId: integer (nullable = true)
* |-- shoot: string (nullable = true)
* |-- timeLong: string (nullable = true)
* |-- title: string (nullable = true)
* |-- movieId: integer (nullable = true)
* |-- avgRating: double (nullable = true)
*/
我们在继续操作这个DataFrame时,可能就会报错,如下:org.apache.spark.sql.AnalysisException: Reference ‘movieId’ is ambiguous
解决方案有两种方法可以用来移除重复的列
- 方法一:join表达式使用字符串数组(用于join的列)
Seq<String> joinColumns = JavaConversions.asScalaBuffer(Arrays.asList("movieId")).toList();
Dataset<Row> moviesWithRating = moviesDF.join(
averageRatingMoviesDF,
joinColumns,
"inner");
这里DataFrame moviesDF和averageRatingMoviesDF使用了movieId和movieId两列来做join,返回的结果会对这两列去重
如果是scala,解决方案如下:文章来源:https://www.toymoban.com/news/detail-810064.html
val moviesWithRating = moviesDf.join(averageRatingMoviesDF, Seq("movieId"))
- 方法二:使用select返回指定的列
Dataset<Row> moviesWithRating = moviesDF
.join(averageRatingMoviesDF,
moviesDF.col("movieId").equalTo(averageRatingMoviesDF.col("movieId")))
.select(
moviesDF.col("movieId"),
col("actors"),
col("description"),
col("directors"),
col("genres"),
col("issue"),
col("language"),
col("shoot"),
col("timeLong"),
col("title"),
col("avgRating")
);
说明:
如果列较少, 推荐使用第二种.
如果列较多, 推荐使用第一种.文章来源地址https://www.toymoban.com/news/detail-810064.html
到了这里,关于Spark DataFrame join后移除重复的列的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!