-
Notifications
You must be signed in to change notification settings - Fork 1
Save sorted file
Trying to save on disk files sorted with any logic could be very challenging if you use Spark with python.
Dataset (userId,itemId,ts): [(3458, 12, 89614), (7186, 1228, 94614), (1399, 367, 99614), (7186, 633, 100614), (3942, 446, 101614), (3458, 402, 103614), (7186, 19, 112614), (7186, 744, 113614), (3104, 890, 116614), (3458, 366, 69403), (5182, 429, 82403), (2460, 602, 87403), (2460, 1104, 88403), (3458, 609, 95403), (2460, 246, 107403), ...]
The requested task is "you have to split this dataset with these costrains":
- Each events of each user must be in the same file
- Each event of each user must be sorted by ts
If you are using spark in a local[n] configuration, you just have to take care no couple of workers is working on the same user (or you will have randomly shuffled lines file).
Something like (I really don't like it):
def saveOnFile(x,path='/tmp'):
with open(path+str(x[0]),'a') as f:
f.write(str(x[0])+"\t"str(x[1]))
myRdd.foreach(lambda x: saveOnFile(x))
If you are using Spark in cluster-mode, you just have to create a pairRDD with the same key you would like to use for the file splitting, and then just use:
myRdd.map(lambda x: (x[0],x)).repartitionAndSortWithinPartitions(5)\
.map(lambda x: x[1]).saveAsTextFile('s3n://myBucket/myKey')
'repartitionAndSortWithinPartitions(k)' create k partitions and then use a simple sort function (you can implement your own) sorting ascendly the tuples using the key value as key.
-
Sparking
- Spark
- Spark Streaming
- Contributors