Skip to content

Dataset.to_dataframe() converts multi-index levels both to pandas MultiIndex and to columns #10538

@dylanhmorris

Description

@dylanhmorris

Overview

When to_dataframe() is called on an xarray Dataset with a multi-dimensional index along a given dimension, the index coordinates are translated both:

  • into levels of a pandas MultiIndex for the dataframe
  • into individual columns of the dataframe.

Is this expected and intended behavior?

Main reprex

import numpy as np
import pandas as pd
import xarray as xr

data_dict = dict(x=[1, 2, 1, 2, 1], y=["a", "a", "b", "b", "b"], z=[5, 10, 15, 20, 25])
data_dict_w_dims = {k: ("my_dim", v) for k, v in data_dict.items()}

# create a dataset multi-indexed along "my_dim" by "x" and "y" 
xr_dat = xr.Dataset(data_dict_w_dims).set_coords(["x", "y"]).set_xindex(["x", "y"])

print(xr_dat)
# <xarray.Dataset> Size: 140B
# Dimensions:  (my_dim: 5)
# Coordinates:
#   * my_dim   (my_dim) object 40B MultiIndex
#   * x        (my_dim) int64 40B 1 2 1 2 1
#   * y        (my_dim) <U1 20B 'a' 'a' 'b' 'b' 'b'
# Data variables:
#     z        (my_dim) int64 40B 5 10 15 20 25

print(xr_dat.to_dataframe()) # x and y present both as columns and as multi-index
#       z  x  y
# x y
# 1 a   5  1  a
# 2 a  10  2  a
# 1 b  15  1  b
# 2 b  20  2  b
# 1 b  25  1  b

Cause

I believe the key line is here in the _to_dataframe() internal method:

xarray/xarray/core/dataset.py

Lines 7092 to 7095 in 699d895

def _to_dataframe(self, ordered_dims: Mapping[Any, int]):
from xarray.core.extension_array import PandasExtensionArray
columns_in_order = [k for k in self.variables if k not in self.dims]

The constituent IndexArrays of the multi-index are present in self.variables (and not in self.dims), so they become columns:

"x" in xr_dat.dims
# False
"x" in xr_dat.variables
# True
xr_dat.variables["x"]
# <xarray.IndexVariable 'my_dim' (my_dim: 5)> Size: 40B
# [5 values with dtype=int64]

This has consequences for pandas -> xarray -> pandas conversion

Because of this, converting a MultiIndex-ed pandas dataframe to an xarray Dataset via the xr.Dataset() constructor and then converting back to pandas via .to_dataframe() will not give back the original dataframe.

Reprex

# create a multi-indexed pandas dataframe
pd_df = pd.DataFrame(
   data_dict   
).set_index(["x", "y"])

print(pd_df) # multi-indexed-df with one column
#       z
# x y
# 1 a   5
# 2 a  10
# 1 b  15
# 2 b  20
# 1 b  25

# Conversion to xarray is as expected:
xr_from_pd = xr.Dataset(pd_df)
print(xr_from_pd)
# <xarray.Dataset> Size: 160B
# Dimensions:  (dim_0: 5)
# Coordinates:
#   * dim_0    (dim_0) object 40B MultiIndex
#   * x        (dim_0) int64 40B 1 2 1 2 1
#   * y        (dim_0) object 40B 'a' 'a' 'b' 'b' 'b'
# Data variables:
#     z        (dim_0) int64 40B 5 10 15 20 25

# Converting back to pandas df via `to_dataframe()` yields a df multi-indexed by 
# x and y that also contains `x` and `y` as columns:

print(xr_from_pd.to_dataframe()) # x and y as multi-index and as columns
#      x  y   z
# x y
# 1 a  1  a   5
# 2 a  2  a  10
# 1 b  1  b  15
# 2 b  2  b  20
# 1 b  1  b  25

Thoughts

  • If this behavior is not intended, the flagged line in _to_dataframe() should be changed to determine column names in a way that ignores IndexVariables that form part of a multi-index.
  • It might be important not just to filter to data variables, because one might want coordinates to become columns when they are not going to be part of the pandas MultiIndex, e.g.
# similar dataset with x and y as coordinates but not as a multi-index
dat_no_multiindex = xr.Dataset(
    data_dict_w_dims
).set_coords(["x", "y"])

# potentially intended behavior?
print(dat_no_multiindex.to_dataframe())
#        x  y   z
# my_dim
# 0       1  a   5
# 1       2  a  10
# 2       1  b  15
# 3       2  b  20
# 4       1  b  25

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs triageIssue that has not been reviewed by xarray team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions