Skip to content

Commit d9b86f3

Browse files
authored
Update Partitioning.md
1 parent 7e121e1 commit d9b86f3

File tree

1 file changed

+41
-3
lines changed

1 file changed

+41
-3
lines changed

spark/Partitioning.md

+41-3
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66
in python :
77
`RDD.getNumPartitions()`
88

9+
- to see elements in each partition (using python)
10+
` RDD.glom().collect()`
911

1012
-By default , Spark will decide parallelism but to specify custom level of parallelism, a 2nd parameter can be specified
1113
while performing transformation/aggregate operation as -
@@ -20,15 +22,51 @@ More optimized version of `repartition()` is `coalesce()`
2022

2123
- Paritioning is useful only when dataset is reused multiple times in key oriented operations such as `join`
2224

23-
- Another way custom paritioner can be defined using partitionBy() method and passing a new HashPartitioner(Int) to that method.
25+
- Another way custom paritioner for **paired RDDS** can be defined using partitionBy() method and passing a new **HashPartitioner(Int)/RangePartitioner** to that method.
26+
27+
in Scala:
2428
```
2529
import org.apache.spark.HashPartitioner
2630
2731
val map1 = sc.textFile("moby.txt").flatMap( x => x.split(" ")).map(x=>(x,1))
28-
val PartMap = map1.partitionBy(new HashPartitioner(100))
29-
32+
val PartMap = map1.partitionBy(new HashPartitioner(100)).persist()
3033
```
3134

35+
in Python :
36+
37+
` val PartMap = map1.partitionBy(100).persist()`
38+
39+
_since paritionBy is transformation operation, user persist() after partitionBy, otherwise each time_
40+
41+
_rdd is referenced, it will get partitioned repeatedly. This will negate the effect of partitionBy_
42+
43+
44+
for following operations results in partitioner gets set on output RDD automatically -
45+
- join
46+
- cogroup
47+
- groupWith()
48+
- leftOuterJoin()
49+
- rightOuterJoin()
50+
- flatMapValues() - if parent RDD has partitioner
51+
- filter() - if parent RDD has partitioner
52+
- groupByKey()
53+
- sort()
54+
- reduceByKey()
55+
- combineByKey()
56+
- mapValues() - if parent RDD has partitioner
57+
- partitionBy()
58+
59+
- To maximize the potential for partiioning related optimizations, always use mapValues() or flatMapValues() whenever
60+
there is no change in elements keys
61+
62+
- To further customize partition, function can be defined , for e.g. to parition
63+
64+
```
65+
def hash_part(url)
66+
return(urlparse.urlparse(url).netloc)
67+
68+
RDD.partitionBy(20,hash_part)
69+
```
3270

3371

3472

0 commit comments

Comments
 (0)