You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52509][CORE] Cleanup individual shuffles from fallback storage on RemoveShuffle event
### What changes were proposed in this pull request?
Shuffle data of individual shuffles are deleted from the fallback storage during regular shuffle cleanup.
### Why are the changes needed?
Currently, the shuffle data are only removed from the fallback storage on Spark context shutdown. Long running Spark jobs accumulate shuffle data, though this data is not used by Spark any more. Those shuffles should be cleaned up while Spark context is running.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests and manual test via [reproduction example](https://gist.github.com/EnricoMi/e9daa1176bce4c1211af3f3c5848112a/3140527bcbedec51ed2c571885db774c880cb941).
Run the reproduction example without the ` <<< "$scala"`. In the Spark shell, execute this code:
```scala
import org.apache.spark.sql.SaveMode
val n = 100000000
val j = spark.sparkContext.broadcast(1000)
val x = spark.range(0, n, 1, 100).select($"id".cast("int"))
x.as[Int]
.mapPartitions { it => if (it.hasNext && it.next < n / 100 * 80) Thread.sleep(2000); it }
.groupBy($"value" % 1000).as[Int, Int]
.flatMapSortedGroups($"value"){ case (m, it) => if (it.hasNext && it.next == 0) Thread.sleep(10000); it }
.write.mode(SaveMode.Overwrite).csv("/tmp/spark.csv")
```
This writes some data of shuffle 0 to the fallback storage.
Invoking `System.gc()` removes that shuffle directory from the fallback storage. Exiting the Spark shell removes the whole application directory.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#51199 from EnricoMi/fallback-storage-cleanup.
Authored-by: Enrico Minack <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
0 commit comments