TiViT: Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers

Authors: Simon Roschmann, Quentin Bouniot, Vasilii Feofanov, Ievgen Redko, Zeynep Akata

Abstract

Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain.

Methodology

Illustration of TiViT on a time series sample from ECG200. We split the time series into segments and stack them to form a grayscale image. Then, we patch the image in 2D and feed it into a frozen ViT pretrained on large-scale image datasets. We average the hidden representations from a specific layer and pass them to a learnable classification head. Combining the representations of TiViT and TSFMs such as Mantis further improves classification accuracy.

Dependencies

This repository works with Python 3.11 and PyTorch 2.7. Please create a conda environment and install the dependencies specified in requirements.txt.

conda create -n tivit_env python=3.11
conda activate tivit_env
python -m pip install -r requirements.txt

Datasets

We evaluate TiViT on the UCR and UEA benchmark for time series classification. The benchmark datasets can be loaded via the aeon toolkit. For UCR datasets, we apply by default linear interpolation to handle missing values and use padding to accommodate unequal time series lengths. To use the preprocessing functionality provided by Aeon instead, add the --aeon flag.

Models

Time Vision Transformer (TiViT)

In our study, we evaluate TiViT with various ViT backbones, as summarized in the following table. These ViTs cover a wide range of pretraining paradigms and model sizes in the vision and vision-language domain.

Model	Paper	Checkpoints
CLIP	Radford et al., 2021; Cherti et al., 2022	ViT-B-16, ViT-B-32, ViT-L-14, ViT-H-14
SigLIP 2	Tschannen et al., 2025	SoViT-400m-14
DINOv2	Oquab et al., 2023	ViT-S-14, ViT-B-14, ViT-L-14
MAE	He et al., 2021	ViT-B-16, ViT-L-14, ViT-H-14

Time Series Foundation Model (TSFM)

We compare and optionally fuse TiViT with two state-of-the-art TSFMs exclusively pretrained on time series.

Mantis (Paper, Checkpoint): Transformer model (8 M parameters) pretrained on 2 million time series with contrastive learning
MOMENT (Paper, Checkpoint): Transformer models pretrained on 13 million time series with masked modeling. There exist three different sizes: Small (40 M parameters), Base (125 M parameters), Large (385 M parameters).

Classification

Below is an example command to perform classification with TiViT on the UCR benchmark using hidden representations from layer 14 of an OpenCLIP ViT-H model:

python main.py --vit_1_name laion/CLIP-ViT-H-14-laion2B-s32B-b79K --vit_1_layer 14 --aggregation mean --patch_size sqrt --stride 0.1 --classifier_type logistic_regression --datasets ucr --data_dir /path/to/your/data --result_dir /path/to/save/results --random_seed 2025

Check out arguments.py for a comprehensive overview of all configurable parameters. To further improve classification accuracy, we propose to concatenate the representations of TiViT and traditional TSFMs by adding:

--mantis
--moment [small | base | large]

The following table summarizes the linear classification accuracy of TiViT and TSFMs on the UCR and UEA benchmark.

Model	UCR	UEA
Moment	79.0	69.9
Mantis	80.1	72.4
TiViT (Ours)	81.3	72.0
TiViT + Moment (Ours)	82.5	72.6
TiViT + Mantis (Ours)	83.0	73.7

Intrinsic Dimension and Principal Components

To better understand the hidden representations of ViTs, we analyze how their structure evolves across layers. Use the following command to compute the intrinsic dimension of representations from CLIP ViT-H:

python main.py --vit_1_name laion/CLIP-ViT-H-14-laion2B-s32B-b79K --aggregation mean --patch_size sqrt --stride 0.1 --get_intrinsic_dimension --datasets ucr --data_dir /path/to/your/data --result_dir /path/to/save/results --random_seed 2025

Alternatively, you may compute the number of principal components necessary to cover 95% of the variance in the representations by setting the flag --get_principal_components.

Alignment

The alignment of representations from CLIP ViT-H (layer 14) and Mantis on the UCR benchmark can be measured using the mutual k-NN metric.

python main.py --vit_1_name laion/CLIP-ViT-H-14-laion2B-s32B-b79K --vit_1_layer 14 --aggregation mean --patch_size sqrt --stride 0.1 --mantis --measure_alignment --datasets ucr --data_dir /path/to/your/data --result_dir /path/to/save/results --random_seed 2025

You may specify different ViTs (--vit_1_name, --vit_2_name) and/or TSFMs (--mantis, --moment). Note that the alignment score can only be computed between two models at a time.

Acknowledgments

We appreciate the following repositories for their valuable code:

The Platonic Representation Hypothesis: for assessing model alignment via the mutual k-NN metric
DADApy: for computing the intrinsic dimension of representations

Citation

If you find TiViT useful, please star this repository and cite our work:

@article{roschmann2025tivit,
  title={Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers},
  author={Simon Roschmann and Quentin Bouniot and Vasilii Feofanov and Ievgen Redko and Zeynep Akata},
  journal={arXiv preprint arXiv:2506.08641},
  year={2025}
}

Contact

If you have any questions, feel free to contact us: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TiViT: Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers

Abstract

Methodology

Dependencies

Datasets

Models

Time Vision Transformer (TiViT)

Time Series Foundation Model (TSFM)

Classification

Intrinsic Dimension and Principal Components

Alignment

Acknowledgments

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

License

ExplainableML/TiViT

Folders and files

Latest commit

History

Repository files navigation

TiViT: Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers

Abstract

Methodology

Dependencies

Datasets

Models

Time Vision Transformer (TiViT)

Time Series Foundation Model (TSFM)

Classification

Intrinsic Dimension and Principal Components

Alignment

Acknowledgments

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages