feat(india): Add India solar generation data pipeline by Raakshass · Pull Request #128 · openclimatefix/open-data-pvnet

Raakshass · 2026-02-08T18:44:14Z

Pull Request

Description

Add India solar generation data pipeline for PVNet training, enabling solar forecasting for the Indian power grid.

Changes:

Add India solar data processing scripts (download, process, test, train)
Add PVNet configuration for India solar + GFS NWP
Add India regional grid metadata (5 regions: NR, WR, SR, ER, NER)
Process Mendeley dataset (Jan 2024 - Jun 2025) to Zarr format
Achieve baseline RMSE of 8,270 MW with temporal features

Data source: Mendeley DOI 10.17632/y58jknpgs8.2

Files added:

configs/india_pv_data_config.yaml - India solar data settings
configs/india_gfs_config.yaml - GFS NWP config for India region
configs/india_regions.csv - 5 regional grid metadata
configs/PVNet_configs/datamodule/configuration/india_configuration.yaml - Complete PVNet config
scripts/download_mendeley_india.py - Dataset download instructions
scripts/process_india_data.py - Excel to Zarr conversion
scripts/test_india_pipeline.py - Pipeline validation tests
scripts/train_india_baseline.py - Solar-only baseline model
INDIA_README.md - Contribution documentation

Fixes #121

How Has This Been Tested?

Ran test_india_pipeline.py which validates:

India solar Zarr loading (5,184 hourly rows)
Time alignment verification (Jan 2024 - Jun 2025)
Data integrity checks

Trained baseline model achieving RMSE of 8,270 MW.

Yes

Plotted data distribution and verified solar generation patterns match expected diurnal cycles.

Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

- Add India solar data processing scripts (download, process, test, train) - Add PVNet configuration for India solar + GFS NWP - Add India regional grid metadata (5 regions: NR, WR, SR, ER, NER) - Process Mendeley dataset (Jan 2024 - Jun 2025) to Zarr format - Achieve baseline RMSE of 8,270 MW with temporal features Data source: Mendeley DOI 10.17632/y58jknpgs8.2 Closes openclimatefix#121

…mpatibility - Remove unsupported fields (latitude_center, longitude_center, model) - Update save_samples.py to use PVNetDataset and spawn method - Confirmed: OCF GFS data only covers UK, not India

Raakshass · 2026-02-09T12:46:38Z

Request for Guidance: NWP Data Coverage for India

Hi maintainers! I've made good progress on the India solar data pipeline, but I've hit a blocker regarding NWP (GFS) data coverage.

Completed Work

Downloaded and processed India solar data from Mendeley dataset (5,184 hourly records)
Created india_configuration.yaml with correct generation: schema (not gsp:)
Fixed data schema to match ocf-data-sampler requirements (time_utc, location_id, generation_mw, capacity_mwp, longitude, latitude)
Validated PVNetDataset creation (142 valid t0 times for test period)

Blocker: GFS Data Coverage

When attempting to fetch a sample, I get:

ValueError: 78.0 is not in the interval -10.0: 10.0

The GFS data at s3://ocf-open-data-pvnet/data/gfs/v4/2024.zarr appears to only cover the UK region (-10° to 10°E longitude), while India is at approximately 78°E.

Questions for Maintainers

Is there existing global GFS data in the OCF bucket that I might have missed?
Should I process NOAA's global GFS data (from s3://noaa-gfs-bdp-pds/) for India coverage?
Would a solar-only baseline (no NWP) be acceptable as a first contribution, with NWP integration as a follow-up?

Any guidance would be greatly appreciated!

Raakshass · 2026-02-14T08:03:14Z

Hi @peterdudfield — just following up on my question above about GFS data coverage for India. The existing GFS bucket only covers UK (-10° to 10°E). Would you prefer:

Solar-only contribution first (no NWP), or
I process NOAA global GFS for India?

Happy to split this PR into smaller pieces if that helps with review. Thanks!

peterdudfield · 2026-02-16T12:52:16Z

Hi @peterdudfield — just following up on my question above about GFS data coverage for India. The existing GFS bucket only covers UK (-10° to 10°E). Would you prefer:

Solar-only contribution first (no NWP), or

I process NOAA global GFS for India?

Happy to split this PR into smaller pieces if that helps with review. Thanks!

getting the data for India would be great

- Add download_gfs_india.py using Herbie byte-range downloads - Downloads only specific variables (~2-5MB each vs 300MB full GRIB) - Maps all 14 OCF channels to GFS GRIB search terms - Subsets GRIB data to India bounds (5-39N, 67-99E) - Converts to OCF-compatible Zarr format - Supports monthly/yearly processing with merge - Includes validation against OCF schema - Update gfs.py: implement process_gfs_data() (was NotImplementedError) OCF channel mapping: dlwrf->DLWRF, dswrf->DSWRF, hcc->HCDC, lcc->LCDC, mcc->MCDC, prate->PRATE, r->RH:850mb, t->TMP:2m, tcc->TCDC:entire, u10/v10->UGRD/VGRD:10m, u100/v100->UGRD/VGRD:100m, vis->VIS Verified: 3 forecast steps processed successfully (12-14/14 channels).

- NOMADS mode: subregion downloads (~33KB vs 300MB full GRIB) - Selects India (5-39N, 67-99E) + 14 OCF channels server-side - ~8500x reduction in download size per request - Herbie mode: S3 byte-range for historical data (fallback) - ThreadPoolExecutor with configurable workers (default 6) - Both modes verified: NOMADS returns (137,129) India grid, Herbie extracts 14/14 channels at forecast steps Designed to run on cloud infrastructure or Google Colab for production-scale processing. Local testing verified with both modes.

codecov · 2026-02-16T18:27:17Z

Codecov Report

❌ Patch coverage is 40.68966% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.49%. Comparing base (12d4558) to head (6b0e4b7).
⚠️ Report is 24 commits behind head on main.

Files with missing lines	Patch %	Lines
src/open_data_pvnet/scripts/test_india_pipeline.py	42.06%	73 Missing ⚠️
src/open_data_pvnet/nwp/gfs.py	31.57%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #128      +/-   ##
==========================================
+ Coverage   45.72%   53.49%   +7.76%     
==========================================
  Files          16       13       -3     
  Lines        1124     1159      +35     
==========================================
+ Hits          514      620     +106     
+ Misses        610      539      -71

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- tests/nwp/test_gfs.py: 8 unit tests covering all process_gfs_data branches (valid/invalid regions, custom output_dir, max_days, RuntimeError on None result) using mocked process_month - test_india_pipeline.py: rewritten as proper pytest tests with mocked datasets (no hardcoded local paths, no S3 access needed) - Fixes codecov/patch coverage failure (was 40.68%, target 45.72%)

Raakshass · 2026-02-16T18:49:11Z

@peterdudfield Done — I've added the GFS download pipeline for India in the last 2 commits.

How it works:

Downloads from NOAA GFS via NOMADS GRIB filter (subregion-specific, ~33KB vs 300MB per file)
Converts to OCF-compatible Zarr with all 14 channels (dlwrf, dswrf, hcc, etc.)
India bounds: 5-39°N, 67-99°E at 0.25° resolution
Fallback to Herbie S3 byte-range downloads for historical data

Verified: NOMADS returns correct India grid (137×129), Herbie extracts 14/14 channels at forecast steps.

Blocker: Full-scale processing (months of data) needs cloud compute — NOMADS covers ~10 days, Herbie takes hours per day locally. Could OCF run this on your infra, or should I set up a Colab notebook?

Also resolved the merge conflict and added proper unit tests — all CI checks passing now.

Her0n24 · 2026-02-20T19:18:18Z

@peterdudfield Done — I've added the GFS download pipeline for India in the last 2 commits.

How it works:

Downloads from NOAA GFS via NOMADS GRIB filter (subregion-specific, ~33KB vs 300MB per file)

Converts to OCF-compatible Zarr with all 14 channels (dlwrf, dswrf, hcc, etc.)

India bounds: 5-39°N, 67-99°E at 0.25° resolution

Fallback to Herbie S3 byte-range downloads for historical data

Verified: NOMADS returns correct India grid (137×129), Herbie extracts 14/14 channels at forecast steps.

Blocker: Full-scale processing (months of data) needs cloud compute — NOMADS covers ~10 days, Herbie takes hours per day locally. Could OCF run this on your infra, or should I set up a Colab notebook?

Also resolved the merge conflict and added proper unit tests — all CI checks passing now.

Hi @Raakshass, I've incorporated your GFS download pipeline download_gfs_india.py in my next commit/ PR as well for training PVNet over France as well. It is good work and since the process will be the same for me, this way I didn't have to write duplicate code for essentially the same workflow.

I've the same thoughts whether training the model locally is the best idea as I download the gfs data (took ard 1 hour for 1 month). The .zarr for each month with all the variables is of ~30MB, so storage wouldn't be a problem. However, as you've said the data will span over years with 0.5/1 hr resolution, 14 channels.

I will wait for advice here for before proceeding.

Raakshass mentioned this pull request Feb 8, 2026

Country selection and coordination for PVNet training #121

Open

Raakshass added 3 commits February 9, 2026 02:04

fix(india): Use generation schema compatible with ocf-data-sampler

c20be7c

test: Update validation script for new generation schema

a2f0ae6

fix: Update config schema and save_samples.py for ocf-data-sampler co…

3110bd1

…mpatibility - Remove unsupported fields (latitude_center, longitude_center, model) - Update save_samples.py to use PVNetDataset and spawn method - Confirmed: OCF GFS data only covers UK, not India

Raakshass mentioned this pull request Feb 9, 2026

Code Coverage #120

Open

Raakshass added 3 commits February 16, 2026 22:46

merge: sync with upstream main (file restructuring, EIA data, docs)

6b0e4b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(india): Add India solar generation data pipeline#128

feat(india): Add India solar generation data pipeline#128
Raakshass wants to merge 8 commits intoopenclimatefix:mainfrom
Raakshass:feature/india-solar-pipeline

Raakshass commented Feb 8, 2026

Uh oh!

Raakshass commented Feb 9, 2026

Uh oh!

Raakshass commented Feb 14, 2026

Uh oh!

peterdudfield commented Feb 16, 2026

Uh oh!

codecov bot commented Feb 16, 2026

Uh oh!

Raakshass commented Feb 16, 2026

Uh oh!

Her0n24 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Raakshass commented Feb 8, 2026

Pull Request

Description

How Has This Been Tested?

Checklist:

Uh oh!

Raakshass commented Feb 9, 2026

Request for Guidance: NWP Data Coverage for India

Completed Work

Blocker: GFS Data Coverage

Questions for Maintainers

Uh oh!

Raakshass commented Feb 14, 2026

Uh oh!

peterdudfield commented Feb 16, 2026

Uh oh!

codecov bot commented Feb 16, 2026

Codecov Report

Uh oh!

Raakshass commented Feb 16, 2026

Uh oh!

Her0n24 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants