Skip to content

ifabfoundation/temperature-weather-generator

Repository files navigation

Temperature Weather Generator

License: MIT

A conditional latent diffusion model for generating realistic 24-hour temperature sequences over northern Italy. This project implements a state-of-the-art deep learning pipeline combining Variational Autoencoders (VAE) with diffusion models to generate physically consistent temperature maps.

This work is part of the HMMA project, funded by ICSC - Centro Nazionale di Ricerca in HPC, Big Data e Quantum Computing.

Example: Real vs Generated Sequences

Comparison of real and generated temperature sequences across months

Comparison of 24-hour temperature sequences across all months. First row: real reanalysis data. Following rows: model-generated sequences. Each column represents a different month.


Key Features

  • Conditional Generation: Month-based conditioning for seasonal variation (12 classes)
  • Latent Diffusion: Generation in compressed latent space
  • DDIM Sampling: Fast sampling (100 steps) with deterministic/stochastic options
  • Physically-Based Evaluation: Metrics for extreme events, spatial extent, temporal persistence
  • HPC Training: Trained on Leonardo supercomputer using data parallelism (2 nodes × 4 GPUs per node)

Training Details

Both the VAE and diffusion models were trained on the Leonardo supercomputer (hosted and managed by CINECA) using:

  • Hardware: 2 nodes with 4 GPUs each (8 GPUs total)
  • Strategy: Data parallelism with PyTorch Lightning DDP
  • Precision: Mixed precision (16-bit) training
  • Data: High-resolution temperature data from VHR-REA IT (COSMO_CLM dynamical downscaling of ERA5) over northern Italy (1981-2018 training, 2019 validation)
  • Grid: 128 × 256 spatial resolution (2.2 km grid size), 24-hour temporal sequences

Installation

Pip Installation

# Clone repository
git clone https://github.com/yourusername/temperature-weather-generator.git
cd temperature-weather-generator

# Create virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install PyTorch (with CUDA support)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
pip install -r requirements.txt

Conda Installation

# Clone repository
git clone https://github.com/yourusername/temperature-weather-generator.git
cd temperature-weather-generator

# Create conda environment
conda create -n tempgen python=3.9
conda activate tempgen

# Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install dependencies
pip install -r requirements.txt

Data and Pretrained Models

All data and pretrained models are available for download:

Download Data and Models

Download Folder Structure

After downloading from SharePoint, you'll have:

data/
├── data_anomaly_preprocessed_percentile/  # Preprocessed anomaly data (ready for training)
├── nc/                                     # Generated sequences in NetCDF format
├── npy/                                    # Generated sequences in NumPy format
├── VH_REA_1981_2020_cropER/               # Raw VHR-REA IT data (cropped)
└── *.npy                                   # Climatology statistics files

checkpoints/
├── generator/                              # Diffusion model checkpoints
│   ├── last.ckpt                          # Last training checkpoint
│   └── best.ckpt                          # Best validation loss checkpoint
└── vae/                                    # VAE model checkpoints
    ├── last.ckpt                          # Last training checkpoint
    └── best.ckpt                          # Best validation loss checkpoint

Place these folders in the repository root directory.

Dataset Information

The dataset is derived from VHR-REA IT, a high-resolution (2.2-km) reanalysis produced by dynamical downscaling of ERA5 using COSMO_CLM over Italy.

The raw VHR-REA IT data is available at: CMCC Data Delivery System

The dataset includes:

  • Preprocessed 2-meter temperature anomalies (1981-2020)
  • Climatology statistics (12 months × 24 hours)
  • Training data: 1981-2018
  • Validation data: 2019
  • Test data: 2020
  • Grid: 128 × 256 spatial resolution (2.2 km grid size) over northern Italy

Pretrained Models

  • VAE checkpoint: Trained autoencoder for latent space compression (16× spatial compression)
  • Diffusion checkpoint: Conditional diffusion model with month conditioning (12 classes)

Quick Start

1. Data Preprocessing (if using raw data)

# Edit base_path in scripts/prepare_data.py (line ~771)
python scripts/prepare_data.py

2. Train VAE

# Configure paths in configs/config.yaml
python scripts/train_vae.py

3. Train Conditional Diffusion Model

# Requires trained VAE checkpoint
python scripts/train_diffusion.py

4. Generate Samples

# Update checkpoint paths in scripts/generate.py (lines ~992-994)
python scripts/generate.py

5. Evaluate Results

# Update paths in scripts/evaluate.py (lines ~1021-1024)
python scripts/evaluate.py

Running on Leonardo (CINECA HPC)

SLURM batch scripts are provided for running on the Leonardo supercomputer at CINECA. These scripts are configured for multi-node, multi-GPU training.

Available SLURM Scripts

Script Purpose Resources
prepare_data.sh Data preprocessing 1 node, CPU only
train_vae.sh Train VAE model 2 nodes, 4 GPUs each
train_diffusion.sh Train diffusion model 2 nodes, 4 GPUs each
generate.sh Generate samples 1 node, 1 GPU
evaluate.sh Evaluate results 1 node, CPU only

Setup on Leonardo

# 1. Load required modules
module load profile/deeplrn
module load cineca-ai/4.3.0

# 2. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install additional dependencies (if needed)
pip install -r requirements.txt

Submitting Jobs

# Data preprocessing
sbatch prepare_data.sh

# Train VAE (after data is ready)
sbatch train_vae.sh

# Train diffusion model (after VAE is trained)
sbatch train_diffusion.sh

# Generate samples (after diffusion model is trained)
sbatch generate.sh

# Evaluate results
sbatch evaluate.sh

Customization

Before submitting jobs, update the following in each .sh file:

  • --account: Your CINECA account code
  • --time: Adjust time limit based on your needs
  • --qos: Quality of service (use boost_qos_dbg for testing, normal for production)
  • Virtual environment path if different from venv/

Model Performance

Evaluation Summary

The model was evaluated on 1024 generated samples compared against reanalysis data. The evaluation focuses on phenomenon-based metrics that assess the physical realism of generated temperature anomalies.

Key Finding: The model uses weak conditioning (month-based categorical labels only) but successfully reproduces the overall data distribution, with generated values slightly biased towards the mean. This is expected behavior for diffusion models and does not indicate poor quality - rather, it reflects the probabilistic nature of the generation process.

Distribution Comparison

The generated distribution closely matches the reanalysis distribution in the central region (within ±1σ), with reduced variance in the tails:

Distribution Comparison

Quantile-Quantile Analysis

The Q-Q plot shows that generated quantiles fall within the ±0.3σ tolerance band across most of the distribution, with slight compression at the extremes:

Q-Q Plot

Percentile Generated (σ) Reanalysis (σ) Difference
0.1% -1.63 -2.05 +0.42
1.0% -1.07 -1.62 +0.55
50.0% +0.08 +0.03 +0.05
99.0% +1.17 +1.54 -0.37
99.9% +1.61 +2.00 -0.39

Extreme Event Frequencies

Exceedance Frequencies

The model generates fewer extreme events compared to reanalysis, which is consistent with the slight mean-regression behavior:

Event Type Threshold Generated Reanalysis Ratio
Mild warm +0.5σ 17.43% 25.46% 0.68
Warm anomaly +1.0σ 2.29% 7.69% 0.30
Heatwave +1.5σ 0.18% 1.18% 0.15
Extreme heat +2.0σ 0.01% 0.10% 0.14
Record heat +2.5σ 0.00% 0.01% 0.25
Cold anomaly -1.0σ 1.35% 8.26% 0.16
Cold spell -1.5σ 0.17% 1.59% 0.10
Extreme cold -2.0σ 0.03% 0.14% 0.19
Record cold -2.5σ 0.01% 0.00% 1.27

Spatial Extent of Extreme Events

Spatial Extents

Interpretation and Usage Recommendations

The frequency ratios in the table above should be interpreted as sampling guidance, not model quality metrics. They indicate how many samples need to be generated to obtain a statistically equivalent number of extreme events compared to the reanalysis dataset.

For example:

  • To match the occurrence of heatwave events (ratio ~0.15), generate approximately 7x more samples
  • To match cold spell occurrences (ratio ~0.10), generate approximately 10x more samples

This allows users to:

  1. Generate large ensembles and subsample extreme events for analysis
  2. Adjust sample sizes based on the return period of specific phenomena of interest
  3. Use the model for probabilistic climate scenario generation where the full distribution is needed

Project Structure

temperature-weather-generator/
├── README.md                      # Project documentation
├── requirements.txt               # Python dependencies
├── LICENSE                        # MIT License
├── .gitignore                     # Git ignore rules
│
├── configs/                       # Configuration files
│   └── config.yaml                # Model and training configuration
│
├── models/                        # Model architectures
│   ├── __init__.py
│   ├── vae.py                     # VAE encoder/decoder
│   ├── unet.py                    # UNet denoiser
│   ├── diffusion.py               # Latent diffusion model
│   └── conditioner.py             # Climatology conditioning module
│
├── data/                          # Data handling
│   ├── __init__.py
│   └── dataset.py                 # PyTorch datasets
│
├── sampling/                      # Sampling algorithms
│   ├── __init__.py
│   └── ddim.py                    # DDIM sampler
│
├── utils/                         # Utilities
│   ├── __init__.py
│   └── config.py                  # Configuration loading
│
├── scripts/                       # Executable scripts
│   ├── prepare_data.py            # Data preprocessing
│   ├── train_vae.py               # Train VAE/AE model
│   ├── train_diffusion.py         # Train diffusion model
│   ├── generate.py                # Generate samples
│   └── evaluate.py                # Evaluate results
│
├── prepare_data.sh                # SLURM script for data preprocessing
├── train_vae.sh                   # SLURM script for VAE training
├── train_diffusion.sh             # SLURM script for diffusion training
├── generate.sh                    # SLURM script for generation
└── evaluate.sh                    # SLURM script for evaluation

Acknowledgments

This work is part of the HMMA project, funded by ICSC - Centro Nazionale di Ricerca in HPC, Big Data e Quantum Computing.

Training was performed on the Leonardo supercomputer, hosted and managed by CINECA.

Credits

This work is based on:

  • LDCast - A precipitation nowcasting model based on latent diffusion (MeteoSwiss). LDCast uses the same LDM architecture employed by Stable Diffusion. (Paper)
  • DiffScaler - A meteorological downscaling model using latent diffusion to downscale ERA5 reanalysis data with COSMO_CLM reference (DSIP-FBK). (GMD Paper)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright (c) 2025 IFAB - International Foundation Big Data and Artificial Intelligence for Human Development and ICSC - Centro Nazionale di Ricerca in HPC, Big Data e Quantum Computing


Contact

For questions or collaborations, please open an issue on GitHub.

About

Conditional latent diffusion for 24-hour temperature generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors