Skip to content

imputeTestbench for multivariate time series

Mayank85Y edited this page Mar 14, 2025 · 4 revisions

Background

Data cleaning remains one of the most critical and time-consuming steps in Data Science and Data Analytics. Over the years, numerous methods have been proposed for data cleaning processes such as imputation, outlier detection, formatting, and visualization. Evaluating these methods rigorously for large and complex datasets is a major challenge. Tools like the cleanTS R package (Publication) have made significant strides by automating some of these steps.

One of the core dependencies of cleanTS is the imputeTestbench package (Publication). This package automates performance evaluation and comparison of various imputation methods for time series data. Currently, imputeTestbench primarily supports univariate time series imputation and lacks robust capabilities for multi-variate or large-scale datasets.

In last year's proposal, we introduced plans to extend imputeTestbench to handle multivariate time series efficiently. Now, we aim to incorporate additional updates and advanced features for improved computational performance and broader applicability. This will include features such as parallelization, integration with modern HPC frameworks, and advanced data structures for time series.

Related work

The imputeTestbench package builds on the AutoML concepts for time series imputation, generating missing patterns automatically and evaluating multiple methods simultaneously. Our proposed updates will make imputeTestbench compatible with multivariate datasets and complementary packages like cleanTS, enabling streamlined, end-to-end time series cleaning.

Details of your coding project

The goal of this project is to evolve imputeTestbench into a robust, multivariate-ready, and high-performance R package. Key tasks include:

  1. Multivariate extension:

    • Adapt the existing tool to handle multivariate time series from multiple domains (e.g., environmental, financial, sensor data).
    • Introduce flexible imputation pipelines that support correlated series, ensuring consistent missing value treatment across multiple variables.
  2. Enhanced performance & HPC integration:

    • Migrate data structures to data.table or equivalent to handle large-scale time series data more efficiently.
    • Integrate parallel computing solutions such as future, foreach, or HPC backends to reduce computation time.
    • Optionally explore solutions with Apache Spark or distributed computing frameworks for extremely large datasets.
  3. Advanced imputation methods:

    • Embed modern and state-of-the-art time series imputation techniques (e.g., machine-learning-driven approaches, deep learning methods, or advanced statistical models).
    • Offer a plugin interface or bridging with Python libraries (using reticulate) for specialized imputation methods.
  4. Shiny dashboard and improved user interface:

    • Develop or refine a Shiny-based dashboard for an interactive user experience.
    • Provide real-time visualization of imputation methods’ results, performance comparisons, and parameter tuning.
  5. Extensive testing & documentation:

    • Ensure robust unit testing using testthat with high code coverage.
    • Provide comprehensive vignettes illustrating how to use the new features, including step-by-step workflows for multivariate data.
    • Document HPC integration details and recommended configurations.

Expected impact

With these updates, imputeTestbench will transform into a more powerful AutoML tool for time series imputation. It will support multi-dimensional datasets, modern HPC frameworks, and advanced data structures, making it easier for researchers and practitioners to:

  • Rapidly benchmark multiple imputation methods on large or complex datasets.
  • Scale computations across local multicore systems or distributed HPC environments.
  • Generate reproducible workflows using robust testing and improved documentation.

Ultimately, this will accelerate and standardize time series imputation in domains such as finance, environmental monitoring, sensor networks, and more.

Mentors

  • Aditya Gupta (Evaluating Mentor): Postdoc at the University of Agder, Norway. Has a Ph.D. in data-driven solutions for environmental issues, currently working on AI for Sustainable Aquaculture. [email protected]

  • Neeraj Dhanraj Bokde: Senior Researcher at Technology Innovation Institute, Abu Dhabi, and former Assistant Professor at Aarhus University, Denmark. [email protected]. Has a Ph.D. in Data Science and has contributed to multiple R packages on time series analysis.

Tests

Students, please do one or more of the following tests before contacting the mentors:

  • Easy: Download the imputeTestbench package and demonstrate it on a naturally occurring time series. Document it with RMarkdown.

  • Medium: Suggest a new feature or enhancement you would like to see in the next version of the imputeTestbench package.

  • Hard: Develop a dummy implementation of five new functions and create a vignette, ensuring it passes with no Error/Warning/Note via https://win-builder.r-project.org/.

Solutions of tests

Contributors, please post a link to your test results below:

Contributor Name GitHub Profile Test Results
Avinab Neogy avinabneogy23 Solution
PRIYANSHU tech0priyanshu Solution
Mayank Yadav Mayank85Y Solution
Clone this wiki locally