Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Commit 833323d

Browse files
committed
move README changes to separate PR
1 parent 579d449 commit 833323d

File tree

1 file changed

+36
-118
lines changed

1 file changed

+36
-118
lines changed

README.md

Lines changed: 36 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,41 @@
11
# nowcasting_dataset
2-
Pre-prepare batches of data for use in machine learning training.
3-
4-
This code combines several data sources including:
2+
A multi-process data loader for PyTorch which aligns three separate datasets:
53

64
* Satellite imagery (EUMETSAT SEVIRI RSS 5-minutely data of UK)
75
* Numerical Weather Predictions (NWPs. UK Met Office UKV model from CEDA)
86
* Solar PV power timeseries data (from PVOutput.org, downloaded using
9-
our [pvoutput Python code](https://github.com/openclimatefix/pvoutput).)
10-
* Topographic data.
11-
* The Sun's azimuth and angle.
7+
our [pvoutput Python
8+
code](https://github.com/openclimatefix/pvoutput).)
9+
10+
When we first started writing `nowcasting_dataset`, our intention was
11+
to load and align data from these three datasets on-the-fly during ML
12+
training. And `nowcasting_dataset` can still be used that way! But
13+
it just isn't quite fast enough to keep a modern GPU constantly fed
14+
with data when loading multiple satellite channels and multiple NWP
15+
parameters. So, now, this code is used to pre-prepare thousands of
16+
batches, and save these batches to disk, each as a separate NetCDF
17+
file. These files can then be loaded super-quickly at training time.
18+
The end result is a 12x speedup in training.
1219

13-
This repo doesn't contain the ML models themselves. Please see [this
14-
page for an overview](https://github.com/openclimatefix/nowcasting) of
15-
the Open Climate Fix solar PV nowcasting project, and how our code
16-
repositories fit together.
20+
The script `scripts/prepare_ml_data.py` is used to
21+
pre-compute the training and validation data (the script makes use of the
22+
`nowcasting_dataset` library).
23+
`nowcasting_dataset.dataset.datasets.NetCDFDataset` is a PyTorch Dataset which
24+
loads the pre-prepared batches during ML training.
1725

26+
This repo doesn't contain the ML models themselves. The models are
27+
in: https://github.com/openclimatefix/predict_pv_yield/ and
28+
https://github.com/openclimatefix/satflow, and utils are in
29+
https://github.com/openclimatefix/nowcasting_utils
1830

19-
# User manual
31+
Please see [this page for an
32+
overview](https://github.com/openclimatefix/nowcasting) of the Open
33+
Climate Fix solar PV nowcasting project, and how our code repositories
34+
fit together.
2035

21-
## Installation
36+
# Installation
2237

23-
### `conda`
38+
## Conda
2439

2540
From within the cloned `nowcasting_dataset` directory:
2641

@@ -35,129 +50,32 @@ and [pytorch_lightning](https://github.com/PyTorchLightning/pytorch-lightning) u
3550
```shell
3651
pip install -e .[torch]
3752
```
38-
but it is only used to create a dataloader for machine learning models, and will not be necessary
39-
soon (when the dataloader is moved to `nowcasting_dataloader`).
53+
but it is only used to create a dataloader for machine learning models.
4054

41-
42-
### `pip`
55+
## Pip
4356

4457
A (probably older) version is also available through `pip install nowcasting-dataset`
4558

46-
47-
### `RuntimeError: unable to open shared memory object`
59+
## `RuntimeError: unable to open shared memory object`
4860

4961
To prevent PyTorch failing with an error like `RuntimeError: unable to open shared memory object </torch_2276740_2849291446> in read-write mode`, edit `/etc/security/limits.conf` as root and add this line: `* soft nofile 512000` then log out and log back in again (see [this issue](https://github.com/openclimatefix/nowcasting_dataset/issues/158) for more details).
5062

51-
52-
### PV Live API
63+
## PV Live API
5364
If you want to also install [PVLive](https://github.com/SheffieldSolar/PV_Live-API) then use `pip install git+https://github.com/SheffieldSolar/PV_Live-API
5465
`
5566

56-
### Pre-commit
67+
## Pre-commit
5768

5869
A pre commit hook has been installed which makes `black` run with every commit. You need to install
5970
`black` and `pre-commit` (these will be installed by `conda` or `pip` when installing
6071
`nowcasting_dataset`) and run `pre-commit install` in this repo.
6172

62-
63-
## Testing
73+
# Testing
6474

6575
To test using the small amount of data stored in this repo: `py.test -s`
6676

6777
To test using the full dataset on Google Cloud, add the `--use_cloud_data` switch.
6878

79+
# Documentation
6980

70-
## Downloading data
71-
72-
### Satellite data
73-
74-
Use [Satip](https://github.com/openclimatefix/Satip) to download
75-
native EUMETSAT SEVIRI RSS data from EUMETSAT's API and then convert
76-
to an intermediate file format.
77-
78-
79-
### PV data from PVOutput.org
80-
81-
Download PV timeseries data from PVOutput.org using
82-
our PVOutput code](https://github.com/openclimatefix/pvoutput).
83-
84-
85-
### Numerical weather predictions from the UK Met Office
86-
87-
Request access to the [UK Met Office data on CEDA](https://catalogue.ceda.ac.uk/uuid/f47bc62786394626b665e23b658d385f).
88-
89-
Once you have a username and password, download using:
90-
91-
```shell
92-
wget --user=<username> --password=<password> --recursive -nH --cut-dirs=5 --no-clobber \
93-
--reject-regex "[[:digit:]]{8}(03|09|15|21)00.*\.grib$" \
94-
--reject-regex "T120\.grib$" \
95-
--reject-regex "Wholesale5.*\.grib$" \
96-
ftp://ftp.ceda.ac.uk/badc/ukmo-nwp/data/ukv-grib
97-
```
98-
99-
(You probably want to run this in a `gnu screen` session if you're SSH'ing into a VM or remote server).
100-
101-
What are all those `--reject-regex` instructions doing?
102-
103-
* `--reject-regex "[[:digit:]]{8}(03|09|15|21)00.*\.grib$"` rejects all NWPs initialised at
104-
3, 9, 15, or 21 hours (and so you end up with "only" four initialisations per day: 00, 06, 12, 18).
105-
* `--reject-regex "T120\.grib$"` rejects the `T120` files, which contain forecast steps from
106-
2 days and 9 hours ahead, to 5 days ahead, in 3-hourly increments. So we accept the
107-
`Wholesale[1234].grib` files (steps from 00:00 to 1 day and 12 hours ahead, in hourly increments)
108-
and `Wholesale[1234]T54.grib` files (step runs from 1 day and 13 hours ahead to 2 days and 6 hours
109-
ahead. Hourly increments from 1 day and 13 hours ahead to 2 days ahead.
110-
Then 3-hourly increments).
111-
* `--reject-regex "Wholesale5.*\.grib$"` rejects the `Wholesale5` files, which are just static
112-
topography data, so no need to download multiple copies of this data!
113-
114-
Detailed docs of the Met Office data is available [here](http://cedadocs.ceda.ac.uk/1334/1/uk_model_data_sheet_lores1.pdf).
115-
116-
117-
### GSP-level estimates of PV outturn from PV Live Regional
118-
119-
TODO
120-
121-
122-
### Topographical data
123-
124-
TODO
125-
126-
127-
## Configure `nowcasting_dataset` to point to the downloaded data
128-
129-
Copy and modify one of the config yaml files in
130-
[`nowcasting_dataset/config/`](https://github.com/openclimatefix/nowcasting_dataset/tree/main/nowcasting_dataset/config)
131-
132-
133-
## Prepare ML batches
134-
135-
Run [`scripts/prepare_ml_data.py`](https://github.com/openclimatefix/nowcasting_dataset/blob/main/scripts/prepare_ml_data.py)
136-
137-
138-
## Load prepared ML batches into an ML model
139-
140-
`nowcasting_dataset.dataset.datasets.NetCDFDataset` is a PyTorch
141-
Dataset which loads the pre-prepared batches during ML training
142-
(although this will soon be moved to a separate
143-
[`nowcasting_dataloader`
144-
repository](https://github.com/openclimatefix/nowcasting_dataloader)).
145-
146-
147-
## What exactly is in each batch?
148-
149-
Please see the `data_sources/<modality>/<modality>_model.py` files
150-
(where `<modality>` is one of {datetime, metadata, gsp, nwp, pv,
151-
satellite, sun, topographic}) for documentation about the different
152-
data fields in each example / batch.
153-
154-
155-
# History of nowcasting_dataset
156-
When we first started writing `nowcasting_dataset`, our intention was
157-
to load and align data from these three datasets on-the-fly during ML
158-
training. But it just isn't quite fast enough to keep a modern GPU constantly fed
159-
with data when loading multiple satellite channels and multiple NWP
160-
parameters. So, now, this code is used to pre-prepare thousands of
161-
batches, and save these batches to disk, each as a separate NetCDF
162-
file. These files can then be loaded super-quickly at training time.
163-
The end result is a 12x speedup in training.
81+
Please see the [`Example` class](https://github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/dataset/example.py) for documentation about the different data fields in each example / batch.

0 commit comments

Comments
 (0)