Spark学习笔记——转换操作(四)

系统管理员 2023-02-11 11:25 136阅读 0赞

*  基础转换操作

*  键值转换操作

### 键值转换操作 ###

*  cogroup\[W\](other: RDD\[(K, W)\]): RDD\[(K, (Iterable\[V\], Iterable\[W\]))\]
 *  cogroup\[W\](other: RDD\[(K, W)\], numPartitions: Int): RDD\[(K, (Iterable\[V\], Iterable\[W\]))\]
 *  cogroup\[W\](other: RDD\[(K, W)\], partitioner: Partitioner): RDD\[(K, (Iterable\[V\], Iterable\[W\]))\]
 *  cogroup\[W!, W2\](other1: RDD\[(K, W1\])\], other2: RDD\[(K, W2)\]): RDD\[(K, (Iterable\[V\], Iterable\[W1\], Iterable\[W2\]))\]
 *  cogroup\[W1, W2\](other1: RDD\[(K, W1)\], other2: RDD\[(K, W2)\], numPartitions: Int): RDD\[(K, (Iterable\[V\], Iterable\[W1\], Iterable\[W2\]))\]
 *  cogroup\[W1, W2\](other1: RDD\[(K, W1)\], other2: RDD\[(K, W2)\], partitioner: Partitioner): RDD\[(K, (Iterable\[V\], Iterable\[W1\], Iterable\[W2\]))\]
 *  cogroup\[W!, W2, W3\](other1: RDD\[(K, W1\])\], other2: RDD\[(K, W2)\], other3: RDD\[(K, W3)\]): RDD\[(K, (Iterable\[V\], Iterable\[W1\], Iterable\[W2\], Iterable\[W3\]))\]
 *  cogroup\[W1, W2\](other1: RDD\[(K, W1)\], other2: RDD\[(K, W2)\], other3: RDD\[(K, W3)\], numPartitions: Int): RDD\[(K, (Iterable\[V\], Iterable\[W1\], Iterable\[W2\], Iterable\[W3\]))\]
 *  cogroup\[W1, W2\](other1: RDD\[(K, W1)\], other2: RDD\[(K, W2)\], other3: RDD\[(K, W3)\], partitioner: Partitioner): RDD\[(K, (Iterable\[V\], Iterable\[W1\], Iterable\[W2\], Iterable\[W3\]))\]

cogroup类似于SQL中的全外连接，返回左右RDD中的记录，关联不上的为空

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
    rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24
    
    scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
    rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24
    
    scala> var rdd3 = sc.makeRDD(Array(("A", "A"), ("E", "E")), 2)
    rdd3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[28] at makeRDD at <console>:24
    
    scala> rdd1.cogroup(rdd2).collect
    res26: Array[(String, (Iterable[String], Iterable[String]))] = Array((B,(CompactBuffer(2),CompactBuffer())), (D,(CompactBuffer(),CompactBuffer(d))), (A,(CompactBuffer(1),CompactBuffer(a))), (C,(CompactBuffer(3),CompactBuffer(c))))
    
    scala> rdd1.cogroup(rdd2, rdd3).collect
    res27: Array[(String, (Iterable[String], Iterable[String], Iterable[String]))] = Array((B,(CompactBuffer(2),CompactBuffer(),CompactBuffer())), (D,(CompactBuffer(),CompactBuffer(d),CompactBuffer())), (A,(CompactBuffer(1),CompactBuffer(a),CompactBuffer(A))), (C,(CompactBuffer(3),CompactBuffer(c),CompactBuffer())), (E,(CompactBuffer(),CompactBuffer(),CompactBuffer(E))))

*  join\[W\](other: RDD\[(K, W)\]): RDD\[(K, (V, W))\]
 *  join\[W\](other: RDD\[(K, W)\], numPartitions: Int): RDD\[(K, (V, W))\]
 *  join\[W\](other: RDD\[(K, W)\], partitioner: Partitioner): RDD\[(K, (V, W))\]
 *  fullOuterJoin\[W\](other: RDD\[(K, W)\]): RDD\[(K, (Option\[V\], Option\[W\]))\]
 *  fullOuterJoin\[W\](other: RDD\[(K, W)\], numPartitions: Int): RDD\[(K, (Option\[V\], Option\[W\]))\]
 *  fullOuterJoin\[W\](other: RDD\[(K, W)\], partitioner: Partitioner): RDD\[(K, (Option\[V\], Option\[W\]))\]
 *  leftOuterJoin\[W\](other: RDD\[(K, W)\]): RDD\[(K, (V, Option\[W\]))\]
 *  leftOuterJoin\[W\](other: RDD\[(K, W)\], numPartition: Int): RDD\[(K, (V, Option\[W\]))\]
 *  leftOuterJoin\[W\](other: RDD\[(K, W)\], partitioner: Partitioner): RDD\[(K, (V, Option\[W\]))\]
 *  rightOuterJoin\[W\](other: RDD\[(K, W)\]): RDD\[(K, (V, Option\[W\]))\]
 *  rightOuterJoin\[W\](other: RDD\[(K, W)\], numPartition: Int): RDD\[(K, (V, Option\[W\]))\]
 *  rightOuterJoin\[W\](other: RDD\[(K, W)\], partitioner: Partitioner): RDD\[(K, (V, Option\[W\]))\]

join、fullOuterJoin、leftOuterJoin和rightOuterJoin操作对RDD\[K, V\]中K值相等的进行连接操作，分别对应内连接、全连接、左连接和有连接，其内部都是通过cogroup实现的。

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
    rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24
    
    scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
    rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24
    
    scala> rdd1.join(rdd2).collect
    res28: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))
    
    scala> rdd1.leftOuterJoin(rdd2).collect
    res29: Array[(String, (String, Option[String]))] = Array((B,(2,None)), (A,(1,Some(a))), (C,(3,Some(c))))
    
    scala> rdd1.rightOuterJoin(rdd2).collect
    res30: Array[(String, (Option[String], String))] = Array((D,(None,d)), (A,(Some(1),a)), (C,(Some(3),c)))
    
    scala> rdd1.fullOuterJoin(rdd2)
    res31: org.apache.spark.rdd.RDD[(String, (Option[String], Option[String]))] = MapPartitionsRDD[46] at fullOuterJoin at <console>:28
    
    scala> rdd1.fullOuterJoin(rdd2).collect
    res32: Array[(String, (Option[String], Option[String]))] = Array((B,(Some(2),None)), (D,(None,Some(d))), (A,(Some(1),Some(a))), (C,(Some(3),Some(c))))

*  subtractByKey\[W\](other: RDD\[(K, W)\]): RDD\[(K, V)\]
 *  subtractByKey\[W\](other: RDD\[(K, W)\], p: Partitioner): RDD\[(K, V)\]
 *  subtractByKey\[W\](other: RDD\[(K, W)\], numPartitions: Int): RDD\[(K, V)\]

subtractByKey操作类似于subtract，区别在于针对的是键值RDD\[K, V\]

参考:

\[1\] 郭景瞻. 图解Spark:核心技术与案例实战\[M\]. 北京:电子工业出版社, 2017.