Save sorted file

Trying to save on disk files sorted with any logic could be very challenging if you use Spark with python.

Dataset (userId,itemId,ts): [(3458, 12, 89614), (7186, 1228, 94614), (1399, 367, 99614), (7186, 633, 100614), (3942, 446, 101614), (3458, 402, 103614), (7186, 19, 112614), (7186, 744, 113614), (3104, 890, 116614), (3458, 366, 69403), (5182, 429, 82403), (2460, 602, 87403), (2460, 1104, 88403), (3458, 609, 95403), (2460, 246, 107403), ...]

The requested task is "you have to split this dataset with these costrains":

Each events of each user must be in the same file
Each event of each user must be sorted by ts

If you are using spark in a local[n] configuration, you just have to take care no couple of workers is working on the same user (or you will have randomly shuffled lines file).

Something like (I really don't like it):

def saveOnFile(x,path='/tmp'): 
    with open(path+str(x[0]),'a') as f: 
        f.write(str(x[0])+"\t"str(x[1])) 
myRdd.foreach(lambda x: saveOnFile(x))

If you are using Spark in cluster-mode, you just have to create a pairRDD with the same key you would like to use for the file splitting, and then just use:

myRdd.map(lambda x: (x[0],x)).repartitionAndSortWithinPartitions(5)\
.map(lambda x: x[1]).saveAsTextFile('s3n://myBucket/myKey')

'repartitionAndSortWithinPartitions(k)' create k partitions and then use a simple sort function (you can implement your own) sorting ascendly the tuples using the key value as key.

Sparking
- Spark
  - How to...
    - [Use pySpark from CLI] (https://github.com/ContentWise/sparking/wiki/Use-Spark-(python)-from-CLI)
    - Split your sorted dataset in different files
    - [Order couple of items] (https://github.com/ContentWise/sparking/wiki/From-score-to-rank)
  - Solution for memory/disk leaks issues
    - The persist paradox
  - Best Practices
- Spark Streaming
  - How to
    - Kafka integration
    - Flume integration
  - Best Practices
Contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Save sorted file

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally