-
Notifications
You must be signed in to change notification settings - Fork 6
imputeTestbench for multivariate time series
Data cleaning remains one of the most critical and time-consuming steps in Data Science and Data Analytics. Over the years, numerous methods have been proposed for data cleaning processes such as imputation, outlier detection, formatting, and visualization. Evaluating these methods rigorously for large and complex datasets is a major challenge. Tools like the cleanTS R package (Publication) have made significant strides by automating some of these steps.
One of the core dependencies of cleanTS is the imputeTestbench package (Publication). This package automates performance evaluation and comparison of various imputation methods for time series data. Currently, imputeTestbench primarily supports univariate time series imputation and lacks robust capabilities for multi-variate or large-scale datasets.
In last year's proposal, we introduced plans to extend imputeTestbench to handle multivariate time series efficiently. Now, we aim to incorporate additional updates and advanced features for improved computational performance and broader applicability. This will include features such as parallelization, integration with modern HPC frameworks, and advanced data structures for time series.
The imputeTestbench package builds on the AutoML concepts for time series imputation, generating missing patterns automatically and evaluating multiple methods simultaneously. Our proposed updates will make imputeTestbench compatible with multivariate datasets and complementary packages like cleanTS, enabling streamlined, end-to-end time series cleaning.
The goal of this project is to evolve imputeTestbench into a robust, multivariate-ready, and high-performance R package. Key tasks include:
-
Multivariate extension:
- Adapt the existing tool to handle multivariate time series from multiple domains (e.g., environmental, financial, sensor data).
- Introduce flexible imputation pipelines that support correlated series, ensuring consistent missing value treatment across multiple variables.
-
Enhanced performance & HPC integration:
- Migrate data structures to
data.table
or equivalent to handle large-scale time series data more efficiently. - Integrate parallel computing solutions such as
future
,foreach
, or HPC backends to reduce computation time. - Optionally explore solutions with Apache Spark or distributed computing frameworks for extremely large datasets.
- Migrate data structures to
-
Advanced imputation methods:
- Embed modern and state-of-the-art time series imputation techniques (e.g., machine-learning-driven approaches, deep learning methods, or advanced statistical models).
- Offer a plugin interface or bridging with Python libraries (using
reticulate
) for specialized imputation methods.
-
Shiny dashboard and improved user interface:
- Develop or refine a Shiny-based dashboard for an interactive user experience.
- Provide real-time visualization of imputation methods’ results, performance comparisons, and parameter tuning.
-
Extensive testing & documentation:
- Ensure robust unit testing using
testthat
with high code coverage. - Provide comprehensive vignettes illustrating how to use the new features, including step-by-step workflows for multivariate data.
- Document HPC integration details and recommended configurations.
- Ensure robust unit testing using
With these updates, imputeTestbench will transform into a more powerful AutoML tool for time series imputation. It will support multi-dimensional datasets, modern HPC frameworks, and advanced data structures, making it easier for researchers and practitioners to:
- Rapidly benchmark multiple imputation methods on large or complex datasets.
- Scale computations across local multicore systems or distributed HPC environments.
- Generate reproducible workflows using robust testing and improved documentation.
Ultimately, this will accelerate and standardize time series imputation in domains such as finance, environmental monitoring, sensor networks, and more.
-
Aditya Gupta (Evaluating Mentor): Postdoc at the University of Agder, Norway. Has a Ph.D. in data-driven solutions for environmental issues, currently working on AI for Sustainable Aquaculture. [email protected]
-
Neeraj Dhanraj Bokde: Senior Researcher at Technology Innovation Institute, Abu Dhabi, and former Assistant Professor at Aarhus University, Denmark. [email protected]. Has a Ph.D. in Data Science and has contributed to multiple R packages on time series analysis.
Students, please do one or more of the following tests before contacting the mentors:
-
Easy: Download the imputeTestbench package and demonstrate it on a naturally occurring time series. Document it with RMarkdown.
-
Medium: Suggest a new feature or enhancement you would like to see in the next version of the imputeTestbench package.
-
Hard: Develop a dummy implementation of five new functions and create a vignette, ensuring it passes with no Error/Warning/Note via https://win-builder.r-project.org/.
Contributors, please post a link to your test results below:
Contributor Name | GitHub Profile | Test Results |
---|---|---|
Avinab Neogy | avinabneogy23 | Solution |
PRIYANSHU | tech0priyanshu | Solution |
Mayank Yadav | Mayank85Y | Solution |
Jiayi Qian - Github link