Skip to content

feat(india): Add India solar generation data pipeline#128

Open
Raakshass wants to merge 8 commits intoopenclimatefix:mainfrom
Raakshass:feature/india-solar-pipeline
Open

feat(india): Add India solar generation data pipeline#128
Raakshass wants to merge 8 commits intoopenclimatefix:mainfrom
Raakshass:feature/india-solar-pipeline

Conversation

@Raakshass
Copy link

Pull Request

Description

Add India solar generation data pipeline for PVNet training, enabling solar forecasting for the Indian power grid.

Changes:

  • Add India solar data processing scripts (download, process, test, train)
  • Add PVNet configuration for India solar + GFS NWP
  • Add India regional grid metadata (5 regions: NR, WR, SR, ER, NER)
  • Process Mendeley dataset (Jan 2024 - Jun 2025) to Zarr format
  • Achieve baseline RMSE of 8,270 MW with temporal features

Data source: Mendeley DOI 10.17632/y58jknpgs8.2

Files added:

  • configs/india_pv_data_config.yaml - India solar data settings
  • configs/india_gfs_config.yaml - GFS NWP config for India region
  • configs/india_regions.csv - 5 regional grid metadata
  • configs/PVNet_configs/datamodule/configuration/india_configuration.yaml - Complete PVNet config
  • scripts/download_mendeley_india.py - Dataset download instructions
  • scripts/process_india_data.py - Excel to Zarr conversion
  • scripts/test_india_pipeline.py - Pipeline validation tests
  • scripts/train_india_baseline.py - Solar-only baseline model
  • INDIA_README.md - Contribution documentation

Fixes #121

How Has This Been Tested?

Ran test_india_pipeline.py which validates:

  1. India solar Zarr loading (5,184 hourly rows)
  2. Time alignment verification (Jan 2024 - Jun 2025)
  3. Data integrity checks

Trained baseline model achieving RMSE of 8,270 MW.

  • Yes

Plotted data distribution and verified solar generation patterns match expected diurnal cycles.

  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

- Add India solar data processing scripts (download, process, test, train)
- Add PVNet configuration for India solar + GFS NWP
- Add India regional grid metadata (5 regions: NR, WR, SR, ER, NER)
- Process Mendeley dataset (Jan 2024 - Jun 2025) to Zarr format
- Achieve baseline RMSE of 8,270 MW with temporal features

Data source: Mendeley DOI 10.17632/y58jknpgs8.2

Closes openclimatefix#121
…mpatibility

- Remove unsupported fields (latitude_center, longitude_center, model)
- Update save_samples.py to use PVNetDataset and spawn method
- Confirmed: OCF GFS data only covers UK, not India
@Raakshass
Copy link
Author

Request for Guidance: NWP Data Coverage for India

Hi maintainers! I've made good progress on the India solar data pipeline, but I've hit a blocker regarding NWP (GFS) data coverage.

Completed Work

  • Downloaded and processed India solar data from Mendeley dataset (5,184 hourly records)
  • Created india_configuration.yaml with correct generation: schema (not gsp:)
  • Fixed data schema to match ocf-data-sampler requirements (time_utc, location_id, generation_mw, capacity_mwp, longitude, latitude)
  • Validated PVNetDataset creation (142 valid t0 times for test period)

Blocker: GFS Data Coverage

When attempting to fetch a sample, I get:

ValueError: 78.0 is not in the interval -10.0: 10.0

The GFS data at s3://ocf-open-data-pvnet/data/gfs/v4/2024.zarr appears to only cover the UK region (-10° to 10°E longitude), while India is at approximately 78°E.

Questions for Maintainers

  1. Is there existing global GFS data in the OCF bucket that I might have missed?
  2. Should I process NOAA's global GFS data (from s3://noaa-gfs-bdp-pds/) for India coverage?
  3. Would a solar-only baseline (no NWP) be acceptable as a first contribution, with NWP integration as a follow-up?

Any guidance would be greatly appreciated!

@Raakshass Raakshass mentioned this pull request Feb 9, 2026
@Raakshass
Copy link
Author

Hi @peterdudfield — just following up on my question above about GFS data coverage for India. The existing GFS bucket only covers UK (-10° to 10°E). Would you prefer:

  1. Solar-only contribution first (no NWP), or
  2. I process NOAA global GFS for India?

Happy to split this PR into smaller pieces if that helps with review. Thanks!

@peterdudfield
Copy link
Contributor

Hi @peterdudfield — just following up on my question above about GFS data coverage for India. The existing GFS bucket only covers UK (-10° to 10°E). Would you prefer:

  1. Solar-only contribution first (no NWP), or
  2. I process NOAA global GFS for India?

Happy to split this PR into smaller pieces if that helps with review. Thanks!

  1. getting the data for India would be great

- Add download_gfs_india.py using Herbie byte-range downloads
  - Downloads only specific variables (~2-5MB each vs 300MB full GRIB)
  - Maps all 14 OCF channels to GFS GRIB search terms
  - Subsets GRIB data to India bounds (5-39N, 67-99E)
  - Converts to OCF-compatible Zarr format
  - Supports monthly/yearly processing with merge
  - Includes validation against OCF schema
- Update gfs.py: implement process_gfs_data() (was NotImplementedError)

OCF channel mapping:
  dlwrf->DLWRF, dswrf->DSWRF, hcc->HCDC, lcc->LCDC, mcc->MCDC,
  prate->PRATE, r->RH:850mb, t->TMP:2m, tcc->TCDC:entire,
  u10/v10->UGRD/VGRD:10m, u100/v100->UGRD/VGRD:100m, vis->VIS

Verified: 3 forecast steps processed successfully (12-14/14 channels).
- NOMADS mode: subregion downloads (~33KB vs 300MB full GRIB)
  - Selects India (5-39N, 67-99E) + 14 OCF channels server-side
  - ~8500x reduction in download size per request
- Herbie mode: S3 byte-range for historical data (fallback)
- ThreadPoolExecutor with configurable workers (default 6)
- Both modes verified: NOMADS returns (137,129) India grid,
  Herbie extracts 14/14 channels at forecast steps

Designed to run on cloud infrastructure or Google Colab for
production-scale processing. Local testing verified with both modes.
@codecov
Copy link

codecov bot commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 40.68966% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.49%. Comparing base (12d4558) to head (6b0e4b7).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
src/open_data_pvnet/scripts/test_india_pipeline.py 42.06% 73 Missing ⚠️
src/open_data_pvnet/nwp/gfs.py 31.57% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #128      +/-   ##
==========================================
+ Coverage   45.72%   53.49%   +7.76%     
==========================================
  Files          16       13       -3     
  Lines        1124     1159      +35     
==========================================
+ Hits          514      620     +106     
+ Misses        610      539      -71     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- tests/nwp/test_gfs.py: 8 unit tests covering all process_gfs_data
  branches (valid/invalid regions, custom output_dir, max_days,
  RuntimeError on None result) using mocked process_month
- test_india_pipeline.py: rewritten as proper pytest tests with
  mocked datasets (no hardcoded local paths, no S3 access needed)
- Fixes codecov/patch coverage failure (was 40.68%, target 45.72%)
@Raakshass
Copy link
Author

@peterdudfield Done — I've added the GFS download pipeline for India in the last 2 commits.

How it works:

  • Downloads from NOAA GFS via NOMADS GRIB filter (subregion-specific, ~33KB vs 300MB per file)
  • Converts to OCF-compatible Zarr with all 14 channels (dlwrf, dswrf, hcc, etc.)
  • India bounds: 5-39°N, 67-99°E at 0.25° resolution
  • Fallback to Herbie S3 byte-range downloads for historical data

Verified: NOMADS returns correct India grid (137×129), Herbie extracts 14/14 channels at forecast steps.

Blocker: Full-scale processing (months of data) needs cloud compute — NOMADS covers ~10 days, Herbie takes hours per day locally. Could OCF run this on your infra, or should I set up a Colab notebook?

Also resolved the merge conflict and added proper unit tests — all CI checks passing now.

@Her0n24
Copy link

Her0n24 commented Feb 20, 2026

@peterdudfield Done — I've added the GFS download pipeline for India in the last 2 commits.

How it works:

  • Downloads from NOAA GFS via NOMADS GRIB filter (subregion-specific, ~33KB vs 300MB per file)
  • Converts to OCF-compatible Zarr with all 14 channels (dlwrf, dswrf, hcc, etc.)
  • India bounds: 5-39°N, 67-99°E at 0.25° resolution
  • Fallback to Herbie S3 byte-range downloads for historical data

Verified: NOMADS returns correct India grid (137×129), Herbie extracts 14/14 channels at forecast steps.

Blocker: Full-scale processing (months of data) needs cloud compute — NOMADS covers ~10 days, Herbie takes hours per day locally. Could OCF run this on your infra, or should I set up a Colab notebook?

Also resolved the merge conflict and added proper unit tests — all CI checks passing now.

Hi @Raakshass, I've incorporated your GFS download pipeline download_gfs_india.py in my next commit/ PR as well for training PVNet over France as well. It is good work and since the process will be the same for me, this way I didn't have to write duplicate code for essentially the same workflow.

I've the same thoughts whether training the model locally is the best idea as I download the gfs data (took ard 1 hour for 1 month). The .zarr for each month with all the variables is of ~30MB, so storage wouldn't be a problem. However, as you've said the data will span over years with 0.5/1 hr resolution, 14 channels.

I will wait for advice here for before proceeding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Country selection and coordination for PVNet training

3 participants