You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: spark/Partitioning.md
+41-3
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@
6
6
in python :
7
7
`RDD.getNumPartitions()`
8
8
9
+
- to see elements in each partition (using python)
10
+
` RDD.glom().collect()`
9
11
10
12
-By default , Spark will decide parallelism but to specify custom level of parallelism, a 2nd parameter can be specified
11
13
while performing transformation/aggregate operation as -
@@ -20,15 +22,51 @@ More optimized version of `repartition()` is `coalesce()`
20
22
21
23
- Paritioning is useful only when dataset is reused multiple times in key oriented operations such as `join`
22
24
23
-
- Another way custom paritioner can be defined using partitionBy() method and passing a new HashPartitioner(Int) to that method.
25
+
- Another way custom paritioner for **paired RDDS** can be defined using partitionBy() method and passing a new **HashPartitioner(Int)/RangePartitioner** to that method.
26
+
27
+
in Scala:
24
28
```
25
29
import org.apache.spark.HashPartitioner
26
30
27
31
val map1 = sc.textFile("moby.txt").flatMap( x => x.split(" ")).map(x=>(x,1))
28
-
val PartMap = map1.partitionBy(new HashPartitioner(100))
29
-
32
+
val PartMap = map1.partitionBy(new HashPartitioner(100)).persist()
30
33
```
31
34
35
+
in Python :
36
+
37
+
` val PartMap = map1.partitionBy(100).persist()`
38
+
39
+
_since paritionBy is transformation operation, user persist() after partitionBy, otherwise each time_
40
+
41
+
_rdd is referenced, it will get partitioned repeatedly. This will negate the effect of partitionBy_
42
+
43
+
44
+
for following operations results in partitioner gets set on output RDD automatically -
45
+
- join
46
+
- cogroup
47
+
- groupWith()
48
+
- leftOuterJoin()
49
+
- rightOuterJoin()
50
+
- flatMapValues() - if parent RDD has partitioner
51
+
- filter() - if parent RDD has partitioner
52
+
- groupByKey()
53
+
- sort()
54
+
- reduceByKey()
55
+
- combineByKey()
56
+
- mapValues() - if parent RDD has partitioner
57
+
- partitionBy()
58
+
59
+
- To maximize the potential for partiioning related optimizations, always use mapValues() or flatMapValues() whenever
60
+
there is no change in elements keys
61
+
62
+
- To further customize partition, function can be defined , for e.g. to parition
0 commit comments