Mean/Std of multi-columns+rows #9197

iamgp · 2024-05-15T16:28:05Z

iamgp
May 15, 2024

Hi,
I have a table that contains columns such as quantity_1, quantity_2, quantity_3, and row_type.

So say I have the following rows

10,11,12, "type1"
20,21,22, "type2"
30,31,32, "type1"

I'd like to calculate the mean and stdev of all rows AND columns where row_type="type1".

So I can group and aggregate like the following:
t.group_by("row_type").aggregate(std=t.quantity1.std())

..but I really want to do
t.group_by("row_type").aggregate(std=[t.quantity_1, t.quantity_2, t.quantity_3].std())

so that it calculates the stdev between 10,11,12,30,31,32 for type1, and just 20,21,22 for type2.

Is this possible? Thanks

Answered by cpcloud

May 28, 2024

You can do this with group_by, but you need to pivot the data into a longer form first:

In [60]: from ibis.interactive import *

In [61]: data
Out[61]:
{'row_type': ['type1', 'type2', 'type1'],
 'quantity_1': [10, 20, 30],
 'quantity_2': [11, 21, 31],
 'quantity_3': [12, 22, 32]}

In [62]: t = ibis.memtable(data)

In [63]: t
Out[63]:
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ row_type ┃ quantity_1 ┃ quantity_2 ┃ quantity_3 ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ string   │ int64      │ int64      │ int64      │
├──────────┼────────────┼────────────┼────────────┤
│ type1    │         10 │         11 │         12 │
│ type2    │         20 │         21 │         22…

View full answer

lostmygithubaccount · 2024-05-15T17:05:08Z

lostmygithubaccount
May 15, 2024
Maintainer

I'm pretty confident there's a way but struggling -- you probably need to create a new column with an array of all the values and do some stuff. will drop this here for now:

[ins] In [1]: import ibis

[ins] In [2]: import ibis.selectors as s

[ins] In [3]: ibis.options.interactive = True
         ...: ibis.options.repr.interactive.max_rows = 10
         ...: ibis.options.repr.interactive.max_columns = None

[ins] In [4]: data = {"row_type": ["type1", "type2", "type1"], "quantity_1": [10, 20, 30], "quantity_2": [11, 21, 31], "quantity_3": [12, 22, 32]}

[ins] In [5]: t = ibis.memtable(data)

[ins] In [6]: t.group_by("row_type").agg(s.across(s.contains("quantity"), ibis._.std()))
Out[6]:
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ row_type ┃ quantity_1 ┃ quantity_2 ┃ quantity_3 ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ string   │ float64    │ float64    │ float64    │
├──────────┼────────────┼────────────┼────────────┤
│ type1    │  14.142136 │  14.142136 │  14.142136 │
│ type2    │       NULL │       NULL │       NULL │
└──────────┴────────────┴────────────┴────────────┘

someone more experienced can probably chime in at some point (though much of the team is at PyCon this week) with a solution

0 replies

cpcloud · 2024-05-28T14:25:35Z

cpcloud
May 28, 2024
Maintainer

You can do this with group_by, but you need to pivot the data into a longer form first:

In [60]: from ibis.interactive import *

In [61]: data
Out[61]:
{'row_type': ['type1', 'type2', 'type1'],
 'quantity_1': [10, 20, 30],
 'quantity_2': [11, 21, 31],
 'quantity_3': [12, 22, 32]}

In [62]: t = ibis.memtable(data)

In [63]: t
Out[63]:
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ row_type ┃ quantity_1 ┃ quantity_2 ┃ quantity_3 ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ string   │ int64      │ int64      │ int64      │
├──────────┼────────────┼────────────┼────────────┤
│ type1    │         10 │         11 │         12 │
│ type2    │         20 │         21 │         22 │
│ type1    │         30 │         31 │         32 │
└──────────┴────────────┴────────────┴────────────┘

In [64]: t.pivot_longer(s.contains("quantity"))
Out[64]:
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓
┃ row_type ┃ name       ┃ value ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩
│ string   │ string     │ int64 │
├──────────┼────────────┼───────┤
│ type1    │ quantity_1 │    10 │
│ type1    │ quantity_2 │    11 │
│ type1    │ quantity_3 │    12 │
│ type2    │ quantity_1 │    20 │
│ type2    │ quantity_2 │    21 │
│ type2    │ quantity_3 │    22 │
│ type1    │ quantity_1 │    30 │
│ type1    │ quantity_2 │    31 │
│ type1    │ quantity_3 │    32 │
└──────────┴────────────┴───────┘

In [65]: t.pivot_longer(s.contains("quantity")).group_by("row_type").agg(std=_.value.std())
Out[65]:
┏━━━━━━━━━━┳━━━━━━━━━━━┓
┃ row_type ┃ std       ┃
┡━━━━━━━━━━╇━━━━━━━━━━━┩
│ string   │ float64   │
├──────────┼───────────┤
│ type1    │ 10.990905 │
│ type2    │  1.000000 │
└──────────┴───────────┘

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean/Std of multi-columns+rows #9197

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Mean/Std of multi-columns+rows #9197

iamgp May 15, 2024

Replies: 2 comments

lostmygithubaccount May 15, 2024 Maintainer

cpcloud May 28, 2024 Maintainer

iamgp
May 15, 2024

lostmygithubaccount
May 15, 2024
Maintainer

cpcloud
May 28, 2024
Maintainer