-
Notifications
You must be signed in to change notification settings - Fork 29
pint-xarray blog post #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 14 commits
6d26cc3
d9596c8
772c427
7ca85ce
d5d3017
f35ff96
829ae25
c47a615
e035caa
d953fb9
684b465
5baef32
64cd10b
baf2948
ba9bcf2
1e32d37
844096b
2aedbac
bfeb716
3d02b83
a1cde77
8d37219
3f358b9
0758dca
8df4609
7c52770
3f0e319
c048904
7b4cfeb
bf63bc8
7179b6c
37cd971
99c11cc
3aa009c
f9bb62c
1b2b832
061aa8b
0283833
7144eb0
94dbe36
69f6035
587e16c
d8d6578
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,292 @@ | ||||||
--- | ||||||
title: "Unit-aware arithmetic in Xarray, via pint" | ||||||
date: "2022-07-06" | ||||||
authors: | ||||||
- Justus Margin | ||||||
- Tom Nicholas | ||||||
summary: "Xarray now supports unit-aware operations by wrapping pint arrays" | ||||||
--- | ||||||
|
||||||
_TLDR: Xarray now supports unit-aware operations by wrapping [pint arrays](https://pint.readthedocs.io/en/stable/), so your code can automatically track the physical units that your data represents:_ | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```ipython | ||||||
In [2]: distance = xr.DataArray(10).pint.quantify("metres") | ||||||
...: time = xr.DataArray(4).pint.quantify("seconds") | ||||||
...: | ||||||
...: distance / time | ||||||
Out[2]: | ||||||
<xarray.DataArray ()> | ||||||
<Quantity(2.5, 'meter / second')> | ||||||
``` | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Units are integral to science | ||||||
|
||||||
All quantities in science have units, whether explicitly or implicitly. (And even dimensionless quantities like ratios still technically have units.) | ||||||
|
||||||
Getting our units right is finicky, and can very easily go unnoticed in our code. | ||||||
|
||||||
Even worse, the consequences of getting units wrong can be huge! | ||||||
andersy005 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The most famous example of a units error has to be NASA's $125 million [Mars Climate Orbiter](https://www.simscale.com/blog/2017/12/nasa-mars-climate-orbiter-metric/), which in 1999 burned up in the Martian atmosphere instead of successfully entering orbit around Mars. | ||||||
A trajectory course correction had gone wrong, and the error was eventually traced back to a units mismatch: the engineers at Lockheed Martin expressed impulse in [pound-force](<https://en.wikipedia.org/wiki/Pound_(force)>) seconds, whereas the engineers at JPL assumed the impulse value their part of the software received was in SI newton seconds. | ||||||
|
||||||
<p align = "center"> | ||||||
<img src = "https://clqtg10snjb14i85u49wifbv-wpengine.netdna-ssl.com/wp-content/uploads/2017/12/Customers.jpg" /> | ||||||
</p> | ||||||
|
||||||
<p align = "center"> | ||||||
Newspaper cartoon depicting the incongruence in the units used by NASA and Lockheed Martin scientists that led to the Mars Climate Orbiter disaster. | ||||||
</p> | ||||||
|
||||||
We should take stories like this seriously: If we can automatically track units we can potentially eliminate a whole class of possible errors in our scientific work... | ||||||
andersy005 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Pint tracks units | ||||||
|
||||||
There are a few packages for handling units in python (notably [unyt](https://github.com/yt-project/unyt) and [astropy.units](https://docs.astropy.org/en/stable/units/)), but for technical reasons we began units integration in xarray with [pint](https://pint.readthedocs.io/en/stable/). | ||||||
These various packages work by providing a numerical array type that acts similarly to a numpy array, and is intended to plug in and replace the raw numpy array (a so-called "duck array type"). | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Pint provides the `Quantity` object, which is a normal numpy array combined with a `pint.Unit`: | ||||||
|
||||||
```python | ||||||
q = np.array([6, 7]) * pint.Unit('metres') | ||||||
print(repr(q)) | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: <Quantity([6 7], 'meter')> | ||||||
``` | ||||||
|
||||||
Pint Quantities act like numpy arrays, except that the units are carried around with the arrays, propagated through operations, and checked during operations involving multiple quantities. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Xarray now wraps Pint | ||||||
|
||||||
Thanks to the [tireless work](https://github.com/pydata/xarray/issues/3594) of xarray core developer Justus Magin, you can now enjoy this automatic unit-handling in xarray! | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Once you create a unit-aware xarray object (see below for how) you can see the units of the data variables displayed as part of the printable representation. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
You also immediately get the key benefits of pint: | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
1. Units are propagated through arithmetic, and new quantities are built using the units of the inputs: | ||||||
|
||||||
```python | ||||||
distance = xr.DataArray(10).pint.quantify("metres") | ||||||
time = xr.DataArray(4).pint.quantify("seconds") | ||||||
|
||||||
distance / time | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: | ||||||
<xarray.DataArray ()> | ||||||
<Quantity(2.5, 'meter / second')> | ||||||
``` | ||||||
|
||||||
2. Dimensionally inconsistent units are caught automatically: | ||||||
|
||||||
```python | ||||||
apples = xr.DataArray(10).pint.quantify("kg") | ||||||
oranges = xr.DataArray(200).pint.quantify("cm^3") | ||||||
|
||||||
apples + oranges | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: | ||||||
DimensionalityError: Cannot convert from 'kilogram' ([mass]) to 'centimeter ** 3' ([length] ** 3) | ||||||
``` | ||||||
|
||||||
3. Unit conversions become simple: | ||||||
|
||||||
```python | ||||||
walk = xr.DataArray(500).pint.quantify('miles') | ||||||
|
||||||
walk.pint.to('parsecs') | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: | ||||||
<xarray.DataArray ()> | ||||||
<Quantity(2.6077643524162074e-11, 'parsec')> | ||||||
``` | ||||||
|
||||||
4. You can specify that custom functions should expect certain units, and convert them if needed: | ||||||
|
||||||
```python | ||||||
def jpl_trajectory_code(impulse): | ||||||
"""This function will only compute the correct result if supplied input in units of Newton-seconds.""" | ||||||
|
||||||
if impulse.pint.units != "newton * second": | ||||||
impulse = impulse.pint.to("Newton * seconds") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
# do some rocket science | ||||||
... | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels a bit strange to me: def jpl_trajectory_code(..., impulse):
# do trajectory propagation
...
return propagated_position.pint.to("km"), propagated_velocity.pint.to("km/s") For functions that don't support units, we'd either write a wrapper that does the conversion (no need to check if the units already are what we need) and strips the units, then adds units to the result, or use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feel free to change the details of this example. The point just needs to be that unit-aware code would have prevented the error that befell the JPL mission. Exactly whether the code raises an error, converts, or whatever doesn't really matter for the pedagogical purposes of this blog post. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just noticed that with that it becomes essentially the same as 3 (unit conversion). Do you want to replace / extend that section? 4 should then be a section on wrapping library functions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay fair point that then they would be the same. Instead consider: The point of section 3 is showing that you can do unit conversions, the point of section 4 is showing that you can require specific units for finicky library functions. The easiest way to demonstrate that you can require specific units on an input is to raise an error. We don't need to over-think it beyond that. Therefore I now think that this would be fine for the purposes of this blog post def jpl_trajectory_code(impulse):
"""A function which will only compute the correct result if supplied input in units of Newton-seconds."""
if impulse.pint.units != "newton * second":
raise pint.DimensionalityError()
# do some rocket science
... It gets the point across. Yes in practice you would probably do this a bit differently, but that's what our pint-xarray docs should demonstrate. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It feels a bit strange to have a blog post advertising What if, instead of presenting it as a new feature, we'd show it as an example using 1 to 3? For unit-aware functions, that's what we'd actually end up doing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Okay - what would that look like? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the example would be the one from #251 (comment), but we'd describe it as something like:
I'm sure we can improve the wording, though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
|
||||||
lockheed_impulse_value = xr.DataArray(5).pint.quantify("force_pounds * seconds") | ||||||
|
||||||
jpl_trajectory_code(lockheed_impulse_value) | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: | ||||||
pint.DimensionalityError | ||||||
``` | ||||||
|
||||||
(Note: We are adding [new features](https://github.com/xarray-contrib/pint-xarray/pull/143) to make unit specification of function arguments more slick.) | ||||||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
In the abstract, tracking units like this is useful in the same way that labelling dimensions with xarray is useful: it helps us avoid errors by relieving us of the burden of remembering arbitrary information about our data. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Quantifying with pint-xarray | ||||||
|
||||||
The easiest way to create a unit-aware xarray object is to use the helper package we made: [pint-xarray](https://github.com/xarray-contrib/pint-xarray). | ||||||
Once you `import pint_xarray` you can access unit-related functionality via `.pint` on any xarray DataArray or Dataset (this works via [xarray's accessor interface](https://xarray.pydata.org/en/stable/internals/extending-xarray.html)). | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Above we have seen examples of quantifying explicitly, where we specify the units in the call to `.quantify()`. | ||||||
We can do this for multiple variables too, and we can also pass `pint.Unit` instances: | ||||||
|
||||||
```python | ||||||
ds = xr.Dataset({'a': 2, 'b': 10}) | ||||||
|
||||||
ds.pint.quantify({'a': 'kg', | ||||||
'b': pint.Unit('moles')}) | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: | ||||||
<xarray.Dataset> | ||||||
Dimensions: () | ||||||
Data variables: | ||||||
a int64 [kg] 2 | ||||||
b int64 [mol] 10 | ||||||
``` | ||||||
|
||||||
Alternatively, we can quantify from the object's `.attrs`, automatically reading the metadata which xarray objects carry around. | ||||||
If nothing is passed to `.quantify()`, it will attempt to parse the `.attrs['units']` entry for each data variable. | ||||||
|
||||||
This means that for scientific datasets which are stored as files with units in their attributes (which netCDF and Zarr can do for example), using pint with xarray becomes as simple as | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```python | ||||||
import pint_xarray | ||||||
|
||||||
ds = open_dataset(filepath).pint.quantify() | ||||||
``` | ||||||
|
||||||
## Dequantifying | ||||||
|
||||||
To convert our pint arrays back into numpy arrays, we can use `.dequantify`. | ||||||
This will strip the units from the arrays and replace them into the `.attrs['units']` of each variable. | ||||||
This is useful when we want to save our data back to a file, as it means that the current units will be preserved in the attributes of a netcdf file (or zarr store etc.), as long as we just do `ds.pint.dequantify().to_netcdf()`. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Dask integration | ||||||
|
||||||
So xarray can wrap dask arrays, and now it can wrap pint quantities… Can we use both together? Yes! | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
You can get a unit-aware, dask-backed array either by `.pint.quantify()`-ing a chunked array, or you can `.pint.chunk()` a quantified array. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
(If you have dask installed, then `open_dataset(f).pint.quantify()` will already give you a dask-backed, quantified array.) | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
From there you can `.compute()` the dask-backed objects as normal, and the units will be retained. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
(Under the hood we now have an `xarray.DataArray` wrapping a `pint.Quantity`, which wraps a `dask.array.Array`, which wraps a `numpy.ndarray`. | ||||||
This "multi-nested duck array" approach can be generalised to include other array libraries (e.g. `scipy.sparse`), but requires [co-ordination](https://github.com/pydata/duck-array-discussion) between the maintainers of the libraries involved.) | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Unit-aware indexes | ||||||
|
||||||
We would love to be able to promote xarray indexes to pint Quantities, as that would allow you to select data subsets in a unit-aware manner like | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```python | ||||||
da = xr.DataArray(name='a', data=[0, 1, 2], dims='x', coords={'x': [1000, 2000, 3000]}) | ||||||
da = da.pint.quantify({'a': 'Pa', 'x': 'm'}) | ||||||
|
||||||
da.pint.sel(x=2 * 'km') | ||||||
``` | ||||||
|
||||||
Unfortunately this will not possible until the ongoing work to extend xarray to support [explicit indexes](https://github.com/pydata/xarray/issues/1603) is complete. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
In the meantime pint-xarray offers a workaround. If you tell `.quantify` the units you wish an index to have, it will store those in `.attrs["units"]` instead. | ||||||
|
||||||
```python | ||||||
time = xr.DataArray([0.1, 0.2, 0.3], dims='time') | ||||||
distance = xr.DataArray(name='distance', | ||||||
data=[10, 20, 25], | ||||||
dims=['time'], | ||||||
coords={'time': time}) | ||||||
distance = distance.pint.quantify({'distance': 'metres', | ||||||
'time': 'seconds'}) | ||||||
print(distance.coords['time'].attrs) | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: {'units': <Unit('second')>} | ||||||
``` | ||||||
|
||||||
This allows us to provide conveniently wrapped versions of common xarray methods like `.sel`, so that you can still select subsets of data in a unit-aware fashion like this: | ||||||
|
||||||
```python | ||||||
distance.pint.sel(time=200 * pint.Unit('milliseconds')) | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: | ||||||
<xarray.DataArray 'distance' ()> | ||||||
<Quantity(20, 'meter')> | ||||||
Coordinates: | ||||||
time float64 200.0 | ||||||
``` | ||||||
|
||||||
Observe how the `.pint.sel` operation has first converted 200 milliseconds to 0.2 seconds, before finding the distance value that occurs at a time position of 0.2 seconds. | ||||||
|
||||||
[This wrapping is currently necessary](https://xarray.pydata.org/en/stable/user-guide/duckarrays.html#missing-features) for any operation which needs to be aware of the units of a dimension coordinate of the dataarray, or any xarray operation which relies on an external library (such as calling `scipy` in `.integrate`). | ||||||
|
||||||
## CF-compliant units for geosciences with cf-xarray | ||||||
|
||||||
Different fields tend to have different niche conventions about how certain units are defined. | ||||||
By default, pint doesn't understand all the unusual units and conventions we use in geosciences. | ||||||
But [pint is customisable](https://pint.readthedocs.io/en/stable/defining.html), and with the help of [cf-xarray](https://github.com/xarray-contrib/cf-xarray) we can teach it about these geoscience-specific units. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
If we `import cf_xarray.units` (before `import pint_xarray`) then we can `quantify` example data from a pangeo climate data store : | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```python | ||||||
import xarray as xr | ||||||
import cf_xarray.units # must come before import pint_xarray | ||||||
import pint_xarray | ||||||
|
||||||
ds = xr.open_dataset('gs://cmip6/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r2i1p1f1/Amon/sfcWind/gn/v20200226/', engine='zarr') | ||||||
ds = ds.pint.quantify() | ||||||
|
||||||
squared_wind = ds['sfcWind'] ** 2 | ||||||
squared_wind.pint.units | ||||||
``` | ||||||
|
||||||
``` | ||||||
Out: <Unit('meter ** 2 / second ** 2')> | ||||||
``` | ||||||
|
||||||
Here (thanks to `cf_xarray`) pint has successfully interpreted the CF-style units `'m s-1'`, then automatically changed them when we squared the wind speed. | ||||||
|
||||||
## Plotting | ||||||
|
||||||
We can complete our real-world example by plotting the data in its new units | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```python | ||||||
import cartopy.crs as ccrs | ||||||
import matplotlib.pyplot as plt | ||||||
|
||||||
p = squared_wind.isel(time="2014-01").plot( | ||||||
subplot_kws=dict(projection=ccrs.Orthographic(-80, 35), facecolor="gray"), | ||||||
transform=ccrs.PlateCarree(), | ||||||
) | ||||||
p.axes.set_global() | ||||||
p.axes.coastlines() | ||||||
plt.show() | ||||||
``` | ||||||
|
||||||
 | ||||||
|
||||||
where `xarray.plot` has detected the pint units automatically. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Conclusion | ||||||
|
||||||
Please have a go! You will need xarray (v0.20+) and pint-xarray. | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Please also tell us about any bugs you find, or documentation suggestions you have on the [xarray](https://github.com/pydata/xarray/issues) or [pint-xarray issue trackers](https://github.com/xarray-contrib/pint-xarray/issues). | ||||||
If you have usage questions you can raise them there, on the [xarray discussions page](https://github.com/pydata/xarray/discussions), or on the [pangeo discourse forum](https://discourse.pangeo.io/). | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The work here to allow xarray to wrap pint objects is part of a [broader effort to generalise xarray](http://xarray.pydata.org/en/stable/roadmap.html#flexible-arrays) to handle a wide variety of data types (so-called "duck array wrapping"). | ||||||
Along with the incoming [support for flexible indexes](http://xarray.pydata.org/en/stable/roadmap.html#flexible-indexes), we are excited for all the new features that this will enable for xarray users! | ||||||
keewis marked this conversation as resolved.
Show resolved
Hide resolved
|
Uh oh!
There was an error while loading. Please reload this page.