Temperature Weather Generator

A conditional latent diffusion model for generating realistic 24-hour temperature sequences over northern Italy. This project implements a state-of-the-art deep learning pipeline combining Variational Autoencoders (VAE) with diffusion models to generate physically consistent temperature maps.

This work is part of the HMMA project, funded by ICSC - Centro Nazionale di Ricerca in HPC, Big Data e Quantum Computing.

Example: Real vs Generated Sequences

Comparison of 24-hour temperature sequences across all months. First row: real reanalysis data. Following rows: model-generated sequences. Each column represents a different month.

Key Features

Conditional Generation: Month-based conditioning for seasonal variation (12 classes)
Latent Diffusion: Generation in compressed latent space
DDIM Sampling: Fast sampling (100 steps) with deterministic/stochastic options
Physically-Based Evaluation: Metrics for extreme events, spatial extent, temporal persistence
HPC Training: Trained on Leonardo supercomputer using data parallelism (2 nodes × 4 GPUs per node)

Training Details

Both the VAE and diffusion models were trained on the Leonardo supercomputer (hosted and managed by CINECA) using:

Hardware: 2 nodes with 4 GPUs each (8 GPUs total)
Strategy: Data parallelism with PyTorch Lightning DDP
Precision: Mixed precision (16-bit) training
Data: High-resolution temperature data from VHR-REA IT (COSMO_CLM dynamical downscaling of ERA5) over northern Italy (1981-2018 training, 2019 validation)
Grid: 128 × 256 spatial resolution (2.2 km grid size), 24-hour temporal sequences

Installation

Pip Installation

# Clone repository
git clone https://github.com/yourusername/temperature-weather-generator.git
cd temperature-weather-generator

# Create virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install PyTorch (with CUDA support)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
pip install -r requirements.txt

Conda Installation

# Clone repository
git clone https://github.com/yourusername/temperature-weather-generator.git
cd temperature-weather-generator

# Create conda environment
conda create -n tempgen python=3.9
conda activate tempgen

# Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install dependencies
pip install -r requirements.txt

Data and Pretrained Models

All data and pretrained models are available for download:

Download Data and Models

Download Folder Structure

After downloading from SharePoint, you'll have:

data/
├── data_anomaly_preprocessed_percentile/  # Preprocessed anomaly data (ready for training)
├── nc/                                     # Generated sequences in NetCDF format
├── npy/                                    # Generated sequences in NumPy format
├── VH_REA_1981_2020_cropER/               # Raw VHR-REA IT data (cropped)
└── *.npy                                   # Climatology statistics files

checkpoints/
├── generator/                              # Diffusion model checkpoints
│   ├── last.ckpt                          # Last training checkpoint
│   └── best.ckpt                          # Best validation loss checkpoint
└── vae/                                    # VAE model checkpoints
    ├── last.ckpt                          # Last training checkpoint
    └── best.ckpt                          # Best validation loss checkpoint

Place these folders in the repository root directory.

Dataset Information

The dataset is derived from VHR-REA IT, a high-resolution (2.2-km) reanalysis produced by dynamical downscaling of ERA5 using COSMO_CLM over Italy.

The raw VHR-REA IT data is available at: CMCC Data Delivery System

The dataset includes:

Preprocessed 2-meter temperature anomalies (1981-2020)
Climatology statistics (12 months × 24 hours)
Training data: 1981-2018
Validation data: 2019
Test data: 2020
Grid: 128 × 256 spatial resolution (2.2 km grid size) over northern Italy

Pretrained Models

VAE checkpoint: Trained autoencoder for latent space compression (16× spatial compression)
Diffusion checkpoint: Conditional diffusion model with month conditioning (12 classes)

Quick Start

1. Data Preprocessing (if using raw data)

# Edit base_path in scripts/prepare_data.py (line ~771)
python scripts/prepare_data.py

2. Train VAE

# Configure paths in configs/config.yaml
python scripts/train_vae.py

3. Train Conditional Diffusion Model

# Requires trained VAE checkpoint
python scripts/train_diffusion.py

4. Generate Samples

# Update checkpoint paths in scripts/generate.py (lines ~992-994)
python scripts/generate.py

5. Evaluate Results

# Update paths in scripts/evaluate.py (lines ~1021-1024)
python scripts/evaluate.py

Running on Leonardo (CINECA HPC)

SLURM batch scripts are provided for running on the Leonardo supercomputer at CINECA. These scripts are configured for multi-node, multi-GPU training.

Available SLURM Scripts

Script	Purpose	Resources
`prepare_data.sh`	Data preprocessing	1 node, CPU only
`train_vae.sh`	Train VAE model	2 nodes, 4 GPUs each
`train_diffusion.sh`	Train diffusion model	2 nodes, 4 GPUs each
`generate.sh`	Generate samples	1 node, 1 GPU
`evaluate.sh`	Evaluate results	1 node, CPU only

Setup on Leonardo

# 1. Load required modules
module load profile/deeplrn
module load cineca-ai/4.3.0

# 2. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install additional dependencies (if needed)
pip install -r requirements.txt

Submitting Jobs

# Data preprocessing
sbatch prepare_data.sh

# Train VAE (after data is ready)
sbatch train_vae.sh

# Train diffusion model (after VAE is trained)
sbatch train_diffusion.sh

# Generate samples (after diffusion model is trained)
sbatch generate.sh

# Evaluate results
sbatch evaluate.sh

Customization

Before submitting jobs, update the following in each .sh file:

--account: Your CINECA account code
--time: Adjust time limit based on your needs
--qos: Quality of service (use boost_qos_dbg for testing, normal for production)
Virtual environment path if different from venv/

Model Performance

Evaluation Summary

The model was evaluated on 1024 generated samples compared against reanalysis data. The evaluation focuses on phenomenon-based metrics that assess the physical realism of generated temperature anomalies.

Key Finding: The model uses weak conditioning (month-based categorical labels only) but successfully reproduces the overall data distribution, with generated values slightly biased towards the mean. This is expected behavior for diffusion models and does not indicate poor quality - rather, it reflects the probabilistic nature of the generation process.

Distribution Comparison

The generated distribution closely matches the reanalysis distribution in the central region (within ±1σ), with reduced variance in the tails:

Quantile-Quantile Analysis

The Q-Q plot shows that generated quantiles fall within the ±0.3σ tolerance band across most of the distribution, with slight compression at the extremes:

Percentile	Generated (σ)	Reanalysis (σ)	Difference
0.1%	-1.63	-2.05	+0.42
1.0%	-1.07	-1.62	+0.55
50.0%	+0.08	+0.03	+0.05
99.0%	+1.17	+1.54	-0.37
99.9%	+1.61	+2.00	-0.39

Extreme Event Frequencies

The model generates fewer extreme events compared to reanalysis, which is consistent with the slight mean-regression behavior:

Event Type	Threshold	Generated	Reanalysis	Ratio
Mild warm	+0.5σ	17.43%	25.46%	0.68
Warm anomaly	+1.0σ	2.29%	7.69%	0.30
Heatwave	+1.5σ	0.18%	1.18%	0.15
Extreme heat	+2.0σ	0.01%	0.10%	0.14
Record heat	+2.5σ	0.00%	0.01%	0.25
Cold anomaly	-1.0σ	1.35%	8.26%	0.16
Cold spell	-1.5σ	0.17%	1.59%	0.10
Extreme cold	-2.0σ	0.03%	0.14%	0.19
Record cold	-2.5σ	0.01%	0.00%	1.27

Spatial Extent of Extreme Events

Interpretation and Usage Recommendations

The frequency ratios in the table above should be interpreted as sampling guidance, not model quality metrics. They indicate how many samples need to be generated to obtain a statistically equivalent number of extreme events compared to the reanalysis dataset.

For example:

To match the occurrence of heatwave events (ratio ~0.15), generate approximately 7x more samples
To match cold spell occurrences (ratio ~0.10), generate approximately 10x more samples

This allows users to:

Generate large ensembles and subsample extreme events for analysis
Adjust sample sizes based on the return period of specific phenomena of interest
Use the model for probabilistic climate scenario generation where the full distribution is needed

Project Structure

temperature-weather-generator/
├── README.md                      # Project documentation
├── requirements.txt               # Python dependencies
├── LICENSE                        # MIT License
├── .gitignore                     # Git ignore rules
│
├── configs/                       # Configuration files
│   └── config.yaml                # Model and training configuration
│
├── models/                        # Model architectures
│   ├── __init__.py
│   ├── vae.py                     # VAE encoder/decoder
│   ├── unet.py                    # UNet denoiser
│   ├── diffusion.py               # Latent diffusion model
│   └── conditioner.py             # Climatology conditioning module
│
├── data/                          # Data handling
│   ├── __init__.py
│   └── dataset.py                 # PyTorch datasets
│
├── sampling/                      # Sampling algorithms
│   ├── __init__.py
│   └── ddim.py                    # DDIM sampler
│
├── utils/                         # Utilities
│   ├── __init__.py
│   └── config.py                  # Configuration loading
│
├── scripts/                       # Executable scripts
│   ├── prepare_data.py            # Data preprocessing
│   ├── train_vae.py               # Train VAE/AE model
│   ├── train_diffusion.py         # Train diffusion model
│   ├── generate.py                # Generate samples
│   └── evaluate.py                # Evaluate results
│
├── prepare_data.sh                # SLURM script for data preprocessing
├── train_vae.sh                   # SLURM script for VAE training
├── train_diffusion.sh             # SLURM script for diffusion training
├── generate.sh                    # SLURM script for generation
└── evaluate.sh                    # SLURM script for evaluation

Acknowledgments

This work is part of the HMMA project, funded by ICSC - Centro Nazionale di Ricerca in HPC, Big Data e Quantum Computing.

Training was performed on the Leonardo supercomputer, hosted and managed by CINECA.

Credits

This work is based on:

LDCast - A precipitation nowcasting model based on latent diffusion (MeteoSwiss). LDCast uses the same LDM architecture employed by Stable Diffusion. (Paper)
DiffScaler - A meteorological downscaling model using latent diffusion to downscale ERA5 reanalysis data with COSMO_CLM reference (DSIP-FBK). (GMD Paper)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or collaborations, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
data		data
docs		docs
models		models
sampling		sampling
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VAE.py		VAE.py
evaluate.sh		evaluate.sh
generate.sh		generate.sh
prepare_data.sh		prepare_data.sh
requirements.txt		requirements.txt
train_diffusion.sh		train_diffusion.sh
train_vae.sh		train_vae.sh

Folders and files

Latest commit

History

Repository files navigation

Temperature Weather Generator

Example: Real vs Generated Sequences

Key Features

Training Details

Installation

Pip Installation

Conda Installation

Data and Pretrained Models

Download Folder Structure

Dataset Information

Pretrained Models

Quick Start

1. Data Preprocessing (if using raw data)

2. Train VAE

3. Train Conditional Diffusion Model

4. Generate Samples

5. Evaluate Results

Running on Leonardo (CINECA HPC)

Available SLURM Scripts

Setup on Leonardo

Submitting Jobs

Customization

Model Performance

Evaluation Summary

Distribution Comparison

Quantile-Quantile Analysis

Extreme Event Frequencies

Spatial Extent of Extreme Events

Interpretation and Usage Recommendations

Project Structure

Acknowledgments

Credits

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages