work on doc

ecmwf-lab · Mar 24, 2024 · e91e2a2 · e91e2a2
1 parent 7f0be07
commit e91e2a2
Show file tree

Hide file tree

Showing 7 changed files with 92 additions and 23 deletions.
diff --git a/docs/building/filters.rst b/docs/building/filters.rst
@@ -1,4 +1,4 @@
-.. _dataset-filters:
+.. _filters:
 
 #########
  Filters

diff --git a/docs/building/introduction.rst b/docs/building/introduction.rst
@@ -1,4 +1,4 @@
-.. _datasets-building:
+.. _building-introduction:
 
 ##############
  Introduction
@@ -41,14 +41,13 @@ source
    The `source` is a software component that given a list of dates and
    variables will return the corresponding fields. A example of source
    is ECMWF's MARS archive, a collection of GRIB or NetCDF files, a
-   database, etc. See :ref:`dataset-sources` for more information.
+   database, etc. See :ref:`sources` for more information.
 
 filter
    A `filter` is a software component that takes as input the output of
    a source or the output of another filter can modify the fields and/or
    their metadata. For example, typical filters are interpolations,
-   renaming of variables, etc. See :ref:`dataset-filters` for more
-   information.
+   renaming of variables, etc. See :ref:`filters` for more information.
 
 ************
  Operations
@@ -72,6 +71,9 @@ concat
    build a dataset that spans several years, when the several sources
    are involved, each providing a different period.
 
+Each operation is considered as a :ref:`source <sources>`, therefore
+operations can be combined to build complex datasets.
+
 *****************
  Getting started
 *****************

diff --git a/docs/building/operations.rst b/docs/building/operations.rst
@@ -1,42 +1,41 @@
-.. _dataset-operations:
+.. _operations:
 
 ############
  Operations
 ############
 
+Operations are blocks of YAML code that translates a list of dates into
+fields.
+
 ******
  join
 ******
 
-The join is the process of combining several sources data. Each
-source is expected to provide different variables at the same dates.
+The join is the process of combining several sources data. Each source
+is expected to provide different variables at the same dates.
 
 .. literalinclude:: input.yaml
-    :language: yaml
-
-
+   :language: yaml
 
 ********
  concat
 ********
 
 The concatenation is the process of combining different sets of
-operation that handle different dates. This is typically used to
-build a dataset that spans several years, when the several sources
-are involved, each providing a different period.
+operation that handle different dates. This is typically used to build a
+dataset that spans several years, when the several sources are involved,
+each providing a different period.
 
 .. literalinclude:: concat.yaml
-    :language: yaml
-
+   :language: yaml
 
 ******
  pipe
 ******
 
-The pipe is the process of transforming fields using filters. The
-first step of a pipe is typically a source, a join or another pipe.
-The following steps are filters.
-
+The pipe is the process of transforming fields using filters. The first
+step of a pipe is typically a source, a join or another pipe. The
+following steps are filters.
 
 .. literalinclude:: pipe.yaml
-    :language: yaml
+   :language: yaml
diff --git a/docs/building/sources.rst b/docs/building/sources.rst
@@ -1,4 +1,4 @@
-.. _dataset-sources:
+.. _sources:
 
 #########
  Sources

diff --git a/docs/index.rst b/docs/index.rst
@@ -11,12 +11,20 @@
 *Anemoi* is a framework for developing machine learning weather
 forecasting models. It comprises of components or packages for preparing
 training datasets, conducting ML model training and a registry for
-datasets and trained models. Anemoi provides tools for operational
+datasets and trained models. *Anemoi* provides tools for operational
 inference, including interfacing to verification software. As a
 framework it seeks to handle many of the complexities that
 meteorological organisations will share, allowing them to easily train
 models from existing recipes but with their own data.
 
+An *Anemoi dataset* is a thin wrapper around a zarr_ store that is
+optimised for training data-driven weather forecasting models. It is
+organised in such a way that I/O operations are minimised.
+
+This documentation is divided into two main sections: :ref:`how to use
+existing datasets <using-introduction>` and :ref:`how to build new
+datasets <building-introduction>`.
+
 -  :doc:`overview`
 -  :doc:`installing`
 -  :doc:`firststeps`

diff --git a/docs/overview.rst b/docs/overview.rst
@@ -1,3 +1,5 @@
+.. _overview:
+
 ##########
  Overview
 ##########

diff --git a/docs/using/introduction.rst b/docs/using/introduction.rst
@@ -1,3 +1,61 @@
+.. _using-introduction:
+
 ##############
  Introduction
 ##############
+
+.. warning::
+
+   The code below still mentions the old name of the package,
+   `ecml_tools`. This will be updated once the package is renamed to
+   `anemoi-datasets`.
+
+An *Anemoi* dataset is a thin wrapper around a zarr_ store that is
+optimised for training data-driven weather forecasting models. It is
+organised in such a way that I/O operations are minimised (see
+:ref:`overview`).
+
+.. _zarr: https://zarr.readthedocs.io/
+
+To open a dataset, you can use the `open_dataset` function.
+
+.. code:: python
+
+   from anemoi_datasets import open_dataset
+
+   ds = open_dataset("path/to/dataset.zarr")
+
+You can then access the data in the dataset using the `ds` object as if
+it was a NumPy array.
+
+.. code:: python
+
+   print(ds.shape)
+
+   print(len(ds))
+
+   print(ds[0])
+
+   print(ds[10:20])
+
+One of the main feature of the *anemoi-datasets* package is the ability
+to subset or combine datasets.
+
+.. code:: python
+
+   from anemoi_datasets import open_dataset
+
+   ds = open_dataset("path/to/dataset.zarr", start=2000, end=2020)
+
+In that case, a dataset is created that only contains the data between
+the years 2000 and 2020. Combining is done by passing multiple paths to
+the `open_dataset` function:
+
+.. code:: python
+
+   from anemoi_datasets import open_dataset
+
+   ds = open_dataset("path/to/dataset1.zarr", "path/to/dataset2.zarr")
+
+In the latter case, the datasets are combined along the time dimension
+or the variable dimension depending on the datasets structure.