Skip to content

WIP:DataScan refactor to expose data and statistics consistently #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 102 commits into from
May 26, 2025

Conversation

tylerriccio33
Copy link
Contributor

@tylerriccio33 tylerriccio33 commented Mar 16, 2025

This is going to be a big one... I'm working to expose the statistics logic in DataScan in a more object oriented way. I'm planning to do 3 major things:

  • Create a more object oriented interface to the column (and data) profiles with defined attributes. This way the IDEs can help us find things instead of blindly remembering a key in some dict, this will also help the static type checker find and fix things we might not immediately realize.
  • Unify the ibis/narwhals interface to use basically the same statistics. I want to avoid branching and duplicated logic if possible. Maybe I can contribute things to narwhals if things are genuinely missing.
  • Use a dataframe to do everything instead of pulling out series and calculating stats iteratively. If we use dataframes we do 2 things - expand access to datascan for other libraries w/o series APIs and take advantage of parallelism. For example, polars can parallelize the stats ops and even share subplans (if we introduce lazy-ism), which will dramatically improve performance.

This is all a draft as of now I just wanted to clue you in.

Also, I don't plan on changing any tests and the interface will remain untouched!

@rich-iannone
Copy link
Member

@tylerriccio33 this is great, thanks for working on this key piece! Could you try merging with main and see if there are any conflicts? I did some recent work within this file (merging #93) just two days ago. All of that work didn't touch the DataScan class (check https://github.com/posit-dev/pointblank/pull/93/files) but it might be good to merge just in case since there was even a previous merged PR (#88).

Also, is it okay if I release a new version of the package while you're working on this? Given that https://posit-dev.github.io/pointblank/reference/col_summary_tbl.html#pointblank.col_summary_tbl is at a good place (and it's in the docs), it makes sense for a new release.

@tylerriccio33
Copy link
Contributor Author

I will push what I have right now for reference but it'll come with a ton of conflicts. I'm about 90% of the way there, I just have to unify the html rendering interfaces for each of the columns. To be clear I'm not changing functionality just unifying the way the comparison data is exposed so i can build on it for Compare.

So, i'll push what i have now but I still have maybe a day or two to go until I think it's in a good spot. Sorry for the giant pr

@tylerriccio33
Copy link
Contributor Author

Sorry this is like a comically large refactor, we might need to meet in person again to figure some stuff out

@rich-iannone
Copy link
Member

This is great! Don’t worry (I’m not worried) and keep going :)

Send me an email anytime if you want to discuss anything through Zoom (we can schedule easily that way).

@tylerriccio33
Copy link
Contributor Author

Don’t mean to hold you hostage on this refactor, it’s virtually all complete, I just need to reroute all the html logic and add comprehensive test cases. Been busy last few nights, tonight I should have this done and we’d be off to the races.

@tylerriccio33
Copy link
Contributor Author

  • Create a more object oriented interface to the column (and data) profiles with defined attributes. This way the IDEs can help us find things instead of blindly remembering a key in some dict, this will also help the static type checker find and fix things we might not immediately realize.

This is there, I think it's a start and can be improved in the future. More bugs will be found and ways to optimize the interface once we add new statistics.

  • Unify the ibis/narwhals interface to use basically the same statistics. I want to avoid branching and duplicated logic if possible. Maybe I can contribute things to narwhals if things are genuinely missing.

There's no more branching, the function takes something Narwhals compliant and returns as such.

  • Use a dataframe to do everything instead of pulling out series and calculating stats iteratively. If we use dataframes we do 2 things - expand access to datascan for other libraries w/o series APIs and take advantage of parallelism. For example, polars can parallelize the stats ops and even share subplans (if we introduce lazy-ism), which will dramatically improve performance.

This was easy, thankfully, but I need to improve the test coverage as it relates to lazy and eager narwhals compliant frames. The code can cover lazy frames somewhat gracefully but it isn't battle tested.

Currently, the big thing I'm still reconnecting is the HTML generation. The pr does generate the table, but it's absent a few of the amazing formatting features you had. I'm pretty close, I just need to reconnect and reapply the logic.

@rich-iannone
Copy link
Member

@tylerriccio33 Take your time, you got this! Honestly, no rush as I'm not touching DataScan-related things for a while anyway.

@tylerriccio33
Copy link
Contributor Author

Do you think the datetime (and date) formatting could be handled by fmt_datetime from gt? Perhaps if there was a compaction option, or something interesting we could do w/the formatting arguments? It would be nice to leverage the html logic that already exists than to custom build it.

https://posit-dev.github.io/great-tables/reference/GT.fmt_datetime.html

@tylerriccio33
Copy link
Contributor Author

Screenshot 2025-05-18 at 9 15 54 PM

So 2 things fixed:

  • I wrapped the datetimes/dates in your html (I think) and they look better but not perfect. I could maybe just keep all the dates one one line, and very slightly increase the column width? The time section of the datetime being split up from the date proper does look pretty nice though.
  • I add a little footnote on the SL.

Let me know what you think!

@rich-iannone
Copy link
Member

This is great! I will look at this closely tonight/tomorrow. (And I really do like the simple footnote compared to the tiny S.L. text.)

@rich-iannone
Copy link
Member

@tylerriccio33 This looks really good to me. As far as styling goes, we could always make little fixes along the way after this is merged. I think the last thing for you to do is to handle the conflicts, and change any tests in test_datascan.py so things are passing.

Not sure why CI isn't running here (I guess has to do with the conflicts?) but we'll probably understand why sooner or later.

Also, I accidentally performed a merge commit that was pushed here (thought it was being pushed to a fork). I reverted that, but I hope it didn't do any harm to the PR. Apologies for that, and please let me know if it affected anything here.

@tylerriccio33
Copy link
Contributor Author

I have no idea why these would fail, they're not failing on my machine and it's code i haven't touched. Probably some dependency problem, I'll look into it

@tylerriccio33
Copy link
Contributor Author

@rich-iannone I cannot for the life of me reproduce this... The if you look at the 2 tests failing are failing on this ("date_time", "timestamp(6)") assertion. I'm syncing the ci test run using uv, so confident there aren't some pip weirdness going on... Any thoughts?

@tylerriccio33
Copy link
Contributor Author

Fixed it, realized main had an xfail. How I wasn't able to reproduce that I have no clue. My solution was to do a procedural check to see what the schema type would be. Something still feels wrong, but it feels non deterministic...

@tylerriccio33
Copy link
Contributor Author

Also this would close #168 . I added uv to the test runner and dep installer in my quest to debug why the ci env could be different from my local dev.

@tylerriccio33
Copy link
Contributor Author

tylerriccio33 commented May 24, 2025

I think this is looking good, still a lot of little TODOs I left myself, which i'll have to tackle eventually :)

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the huge amount of work you put into this!

@rich-iannone rich-iannone merged commit 9a784ef into posit-dev:main May 26, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants