-
Notifications
You must be signed in to change notification settings - Fork 77
Advanced Sparsity Estimation #987
Copy link
Copy link
Open
Labels
LDE winter 2025/26Student project in the course Large-scale Data Engineering at TU Berlin (winter 2025/26).Student project in the course Large-scale Data Engineering at TU Berlin (winter 2025/26).student projectSuitable for a bachelor/master student's programming project.Suitable for a bachelor/master student's programming project.
Metadata
Metadata
Assignees
Labels
LDE winter 2025/26Student project in the course Large-scale Data Engineering at TU Berlin (winter 2025/26).Student project in the course Large-scale Data Engineering at TU Berlin (winter 2025/26).student projectSuitable for a bachelor/master student's programming project.Suitable for a bachelor/master student's programming project.
Motivation: Sparse matrices that contain mostly zero values are commonplace in many application domains and frequently arise as intermediate results during the processing of ML and graph algorithms. By using sparse data representations and sparsity-exploiting kernels, the memory footprint and runtime of a data analysis workload can be reduced asymptotically. In order to automatically choose a sparse representation where beneficial, the system must be aware of the sparsities of intermediate results, which is typically achieved through sparsity estimation at compile-time. While sparsity estimation has been an active field of research for many years, DAPHNE currently only supports the simplest form, naive meta data estimators.
Task: (in C++) Augment DAPHNE’s existing naive sparsity estimation by more sophisticated methods from the literature (see the slides of Matthias Boehm's lecture Architecture of ML Systems (slides 16-19) for an overview). Based on the resulting, more accurate sparsity estimates, make DAPHNE automatically select a dense or sparse (CSR) representation to optimize the processing w.r.t. memory footprint and/or runtime. Show the impact of the improved sparsity estimation on the performance of typical ML and/or graph processing workloads.
Hints:
coming soon