-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
needs triageIssue that has not been reviewed by xarray team memberIssue that has not been reviewed by xarray team member
Description
Overview
When to_dataframe()
is called on an xarray Dataset
with a multi-dimensional index along a given dimension, the index coordinates are translated both:
- into levels of a pandas
MultiIndex
for the dataframe - into individual columns of the dataframe.
Is this expected and intended behavior?
Main reprex
import numpy as np
import pandas as pd
import xarray as xr
data_dict = dict(x=[1, 2, 1, 2, 1], y=["a", "a", "b", "b", "b"], z=[5, 10, 15, 20, 25])
data_dict_w_dims = {k: ("my_dim", v) for k, v in data_dict.items()}
# create a dataset multi-indexed along "my_dim" by "x" and "y"
xr_dat = xr.Dataset(data_dict_w_dims).set_coords(["x", "y"]).set_xindex(["x", "y"])
print(xr_dat)
# <xarray.Dataset> Size: 140B
# Dimensions: (my_dim: 5)
# Coordinates:
# * my_dim (my_dim) object 40B MultiIndex
# * x (my_dim) int64 40B 1 2 1 2 1
# * y (my_dim) <U1 20B 'a' 'a' 'b' 'b' 'b'
# Data variables:
# z (my_dim) int64 40B 5 10 15 20 25
print(xr_dat.to_dataframe()) # x and y present both as columns and as multi-index
# z x y
# x y
# 1 a 5 1 a
# 2 a 10 2 a
# 1 b 15 1 b
# 2 b 20 2 b
# 1 b 25 1 b
Cause
I believe the key line is here in the _to_dataframe()
internal method:
Lines 7092 to 7095 in 699d895
def _to_dataframe(self, ordered_dims: Mapping[Any, int]): | |
from xarray.core.extension_array import PandasExtensionArray | |
columns_in_order = [k for k in self.variables if k not in self.dims] |
The constituent IndexArrays
of the multi-index are present in self.variables
(and not in self.dims
), so they become columns:
"x" in xr_dat.dims
# False
"x" in xr_dat.variables
# True
xr_dat.variables["x"]
# <xarray.IndexVariable 'my_dim' (my_dim: 5)> Size: 40B
# [5 values with dtype=int64]
This has consequences for pandas -> xarray -> pandas conversion
Because of this, converting a MultiIndex
-ed pandas dataframe to an xarray Dataset
via the xr.Dataset()
constructor and then converting back to pandas via .to_dataframe()
will not give back the original dataframe.
Reprex
# create a multi-indexed pandas dataframe
pd_df = pd.DataFrame(
data_dict
).set_index(["x", "y"])
print(pd_df) # multi-indexed-df with one column
# z
# x y
# 1 a 5
# 2 a 10
# 1 b 15
# 2 b 20
# 1 b 25
# Conversion to xarray is as expected:
xr_from_pd = xr.Dataset(pd_df)
print(xr_from_pd)
# <xarray.Dataset> Size: 160B
# Dimensions: (dim_0: 5)
# Coordinates:
# * dim_0 (dim_0) object 40B MultiIndex
# * x (dim_0) int64 40B 1 2 1 2 1
# * y (dim_0) object 40B 'a' 'a' 'b' 'b' 'b'
# Data variables:
# z (dim_0) int64 40B 5 10 15 20 25
# Converting back to pandas df via `to_dataframe()` yields a df multi-indexed by
# x and y that also contains `x` and `y` as columns:
print(xr_from_pd.to_dataframe()) # x and y as multi-index and as columns
# x y z
# x y
# 1 a 1 a 5
# 2 a 2 a 10
# 1 b 1 b 15
# 2 b 2 b 20
# 1 b 1 b 25
Thoughts
- If this behavior is not intended, the flagged line in
_to_dataframe()
should be changed to determine column names in a way that ignoresIndexVariables
that form part of a multi-index. - It might be important not just to filter to data variables, because one might want coordinates to become columns when they are not going to be part of the pandas
MultiIndex
, e.g.
# similar dataset with x and y as coordinates but not as a multi-index
dat_no_multiindex = xr.Dataset(
data_dict_w_dims
).set_coords(["x", "y"])
# potentially intended behavior?
print(dat_no_multiindex.to_dataframe())
# x y z
# my_dim
# 0 1 a 5
# 1 2 a 10
# 2 1 b 15
# 3 2 b 20
# 4 1 b 25
max-sixty and damonbayer
Metadata
Metadata
Assignees
Labels
needs triageIssue that has not been reviewed by xarray team memberIssue that has not been reviewed by xarray team member