feat(india): Add India solar generation data pipeline#128
feat(india): Add India solar generation data pipeline#128Raakshass wants to merge 8 commits intoopenclimatefix:mainfrom
Conversation
- Add India solar data processing scripts (download, process, test, train) - Add PVNet configuration for India solar + GFS NWP - Add India regional grid metadata (5 regions: NR, WR, SR, ER, NER) - Process Mendeley dataset (Jan 2024 - Jun 2025) to Zarr format - Achieve baseline RMSE of 8,270 MW with temporal features Data source: Mendeley DOI 10.17632/y58jknpgs8.2 Closes openclimatefix#121
…mpatibility - Remove unsupported fields (latitude_center, longitude_center, model) - Update save_samples.py to use PVNetDataset and spawn method - Confirmed: OCF GFS data only covers UK, not India
Request for Guidance: NWP Data Coverage for IndiaHi maintainers! I've made good progress on the India solar data pipeline, but I've hit a blocker regarding NWP (GFS) data coverage. Completed Work
Blocker: GFS Data CoverageWhen attempting to fetch a sample, I get: The GFS data at Questions for Maintainers
Any guidance would be greatly appreciated! |
|
Hi @peterdudfield — just following up on my question above about GFS data coverage for India. The existing GFS bucket only covers UK (-10° to 10°E). Would you prefer:
Happy to split this PR into smaller pieces if that helps with review. Thanks! |
|
- Add download_gfs_india.py using Herbie byte-range downloads - Downloads only specific variables (~2-5MB each vs 300MB full GRIB) - Maps all 14 OCF channels to GFS GRIB search terms - Subsets GRIB data to India bounds (5-39N, 67-99E) - Converts to OCF-compatible Zarr format - Supports monthly/yearly processing with merge - Includes validation against OCF schema - Update gfs.py: implement process_gfs_data() (was NotImplementedError) OCF channel mapping: dlwrf->DLWRF, dswrf->DSWRF, hcc->HCDC, lcc->LCDC, mcc->MCDC, prate->PRATE, r->RH:850mb, t->TMP:2m, tcc->TCDC:entire, u10/v10->UGRD/VGRD:10m, u100/v100->UGRD/VGRD:100m, vis->VIS Verified: 3 forecast steps processed successfully (12-14/14 channels).
- NOMADS mode: subregion downloads (~33KB vs 300MB full GRIB) - Selects India (5-39N, 67-99E) + 14 OCF channels server-side - ~8500x reduction in download size per request - Herbie mode: S3 byte-range for historical data (fallback) - ThreadPoolExecutor with configurable workers (default 6) - Both modes verified: NOMADS returns (137,129) India grid, Herbie extracts 14/14 channels at forecast steps Designed to run on cloud infrastructure or Google Colab for production-scale processing. Local testing verified with both modes.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #128 +/- ##
==========================================
+ Coverage 45.72% 53.49% +7.76%
==========================================
Files 16 13 -3
Lines 1124 1159 +35
==========================================
+ Hits 514 620 +106
+ Misses 610 539 -71 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- tests/nwp/test_gfs.py: 8 unit tests covering all process_gfs_data branches (valid/invalid regions, custom output_dir, max_days, RuntimeError on None result) using mocked process_month - test_india_pipeline.py: rewritten as proper pytest tests with mocked datasets (no hardcoded local paths, no S3 access needed) - Fixes codecov/patch coverage failure (was 40.68%, target 45.72%)
|
@peterdudfield Done — I've added the GFS download pipeline for India in the last 2 commits. How it works:
Verified: NOMADS returns correct India grid (137×129), Herbie extracts 14/14 channels at forecast steps. Blocker: Full-scale processing (months of data) needs cloud compute — NOMADS covers ~10 days, Herbie takes hours per day locally. Could OCF run this on your infra, or should I set up a Colab notebook? Also resolved the merge conflict and added proper unit tests — all CI checks passing now. |
Hi @Raakshass, I've incorporated your GFS download pipeline I've the same thoughts whether training the model locally is the best idea as I download the gfs data (took ard 1 hour for 1 month). The .zarr for each month with all the variables is of ~30MB, so storage wouldn't be a problem. However, as you've said the data will span over years with 0.5/1 hr resolution, 14 channels. I will wait for advice here for before proceeding. |
Pull Request
Description
Add India solar generation data pipeline for PVNet training, enabling solar forecasting for the Indian power grid.
Changes:
Data source: Mendeley DOI 10.17632/y58jknpgs8.2
Files added:
Fixes #121
How Has This Been Tested?
Ran test_india_pipeline.py which validates:
Trained baseline model achieving RMSE of 8,270 MW.
Plotted data distribution and verified solar generation patterns match expected diurnal cycles.
Checklist: