Automatically Describe DFs Going Into Data Explorer #24

emeeks · 2018-08-02T17:54:48Z

df.describe(include="all") should run and be included as metadata for any dataframe that's being sent to the Data Explorer component.

The text was updated successfully, but these errors were encountered:

rgbkrk · 2018-08-02T19:01:41Z

Fantastic idea!

I think we can include it as part of the fields object (🎩 tip to @emeeks for suggesting this in person). It could come out like this:

We need to impact:

The Table Schema Spec
Pandas

@alexandercbooth and I prototyped a version that captures the summary statistics just now and this is the code we came up with for pandas:

diff --git a/pandas/io/json/table_schema.py b/pandas/io/json/table_schema.py
index 2dc176648..0460868c1 100644
--- a/pandas/io/json/table_schema.py
+++ b/pandas/io/json/table_schema.py
@@ -113,6 +113,10 @@ def convert_pandas_type_to_json_field(arr, dtype=None):
             field['tz'] = arr.dt.tz.zone
         else:
             field['tz'] = arr.tz.zone
+
+    # TODO: get this to be part of the spec for https://frictionlessdata.io/specs/table-schema/
+    if hasattr(arr, 'describe'):
+        field['summary'] = arr.describe(include="all").to_dict()
     return field

Admittedly, I don't know what the performance implications are. 😬 Perhaps this is fine if it's already being serialized.

Notebook that uses this and will be useful for debugging: https://gist.github.com/rgbkrk/e1b477641128213db71efa34cfdbb8a7

cc @TomAugspurger

@alexandercbooth wants to take on bringing this into pandas.

stale · 2019-08-05T16:19:09Z

This issue hasn't had any activity on it in the last 90 days. Unfortunately we don't get around to dealing with every issue that is opened. Instead of leaving issues open we're seeking to be transparent by closing issues that aren't being prioritized. If no other activity happens on this issue in one week, it will be closed.
It's more than likely that just by me posting about this, the maintainers will take a closer look at these long forgotten issues to help evaluate what to do next.
If you would like to see this issue get prioritized over others, there are multiple avenues 🗓:

Ask how you can help with this issue 👩🏿‍💻👨🏻‍💻
Help solve other issues the team is currently working on 👨🏾‍💻👩🏼‍💻
Donate to nteract so we can support developers to work on these features and bugs more regularly 💰🕐

Thank you!

hydrosquall · 2019-08-24T16:43:34Z

A related project for ideas around what sorts of summary statistics could be piped into the table:

https://github.com/pandas-profiling/pandas-profiling

captainsafia · 2019-09-30T01:20:25Z

For Hacktoberfest 2019 participants: resolving this issue will require changes across the pandas and nteract repos.

@rgbkrk's comment above is a great place to start on the changes required on the Pythons side -- which is the first place to start with this modification.

captainsafia transferred this issue from nteract/nteract May 6, 2020

willingc added the enhancement New feature or request label Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically Describe DFs Going Into Data Explorer #24

Automatically Describe DFs Going Into Data Explorer #24

emeeks commented Aug 2, 2018

rgbkrk commented Aug 2, 2018 •

edited

Loading

stale bot commented Aug 5, 2019

hydrosquall commented Aug 24, 2019

captainsafia commented Sep 30, 2019

Automatically Describe DFs Going Into Data Explorer #24

Automatically Describe DFs Going Into Data Explorer #24

Comments

emeeks commented Aug 2, 2018

rgbkrk commented Aug 2, 2018 • edited Loading

stale bot commented Aug 5, 2019

hydrosquall commented Aug 24, 2019

captainsafia commented Sep 30, 2019

rgbkrk commented Aug 2, 2018 •

edited

Loading