Home [Pyspark] groupby, collect_set 그룹별로 컬럼의 값을 리스트로 변경
Post
Cancel

[Pyspark] groupby, collect_set 그룹별로 컬럼의 값을 리스트로 변경

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F

sc = SparkContext("local")

sqlContext = HiveContext(sc)

df = sqlContext.createDataFrame([
    ("a", None, None),
    ("a", "code1", None),
    ("a", "code2", "name2"),
], ["id", "code", "name"])

df.show()

+---+-----+-----+
| id| code| name|
+---+-----+-----+
|  a| null| null|
|  a|code1| null|
|  a|code2|name2|
+---+-----+-----+

(df
  .groupby("id")
  .agg(F.collect_set("code"),
       F.collect_list("name"))
  .show())

+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
|  a|   [code1, code2]|           [name2]|
+---+-----------------+------------------+
  • 참고
    • https://stackoverflow.com/questions/37580782/pyspark-collect-set-or-collect-list-with-groupby
This post is licensed under CC BY 4.0 by the author.

[Zeppelin] Notebook Level Dynamic Form (=Global Variable)

[Tensorflow] DNN 모델