Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(RFC): Adds altair.datasets #3631

Draft
wants to merge 239 commits into
base: main
Choose a base branch
from
Draft

feat(RFC): Adds altair.datasets #3631

wants to merge 239 commits into from

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 4, 2024

Related

Tracking

Waiting on the next vega-datasets release.
Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.

Discovered a bug that makes some handling of expressions a little less efficient.

Upstreaming some nw.Schema stuff to narwhals

Improve user-facing interface

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

Examples

These all come from the docstrings of:

  • Loader
  • Loader.from_backend
  • Loader.__call__
from altair.datasets import Loader

load = Loader.from_backend("polars")
>>> load
Loader[polars]

cars = load("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

load = Loader.from_backend("pandas")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

load = Loader.from_backend("pandas[pyarrow]")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

load = Loader.from_backend("pandas")
source = load("stocks")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

load = Loader.from_backend("pyarrow")
source = load("stocks")

>>> source.column_names
['symbol', 'date', 'price']

Not required for these requests, but may be helpful to avoid limits
As an example, for comparing against the most recent I've added the 5 most recent
- Basic mechanism for discovering new versions
- Tries to minimise number of and total size of requests
Experimenting with querying the url cache w/ expressions
- `metadata_full.parquet` stores **all known** file metadata
- `GitHub.refresh()` to maintain integrity in a safe manner
- Roughly 3000 rows
- Single release: **9kb** vs 46 releases: **21kb**
- Still undecided exactly how this functionality should work
- Need to resolve `npm` tags != `gh` tags issue as well
- Will be even more useful after merging vega/vega-datasets#663
- Thinking this is a fair tradeoff vs inlining the descriptions into `altair`
  - All the info is available and it is quicker than manually searching the headings in a browser
dangotbanned added a commit that referenced this pull request Feb 5, 2025
@mattijn
Copy link
Contributor

mattijn commented Feb 7, 2025

I’ve added an item to the tracking list in OP. I raised this suggestion during the review as an important improvement to the user interface. I’d be grateful if it could be addressed before merging and not is overlooked. Almost there!🙌

@dangotbanned
Copy link
Member Author

dangotbanned commented Feb 7, 2025

I’ve added an item to the tracking list in OP. I raised this suggestion during the review as an important improvement to the user interface. I’d be grateful if it could be addressed before merging and not is overlooked. Almost there!🙌

Thanks @mattijn

I actually have that one bookmarked, rest assured I have not forgotten!

image

This was what I was thinking of doing, while we wait on narwhals-dev/narwhals#1924 and vega/vega-datasets#654:

FYI, I have tried responding to (#3631 (comment)) twice but both times I ended up writing waaaaay too much.
I've still got those in-progress responses - but hopefully I'll be able to trim it down into something more digestible

Note

(to self) stashed an experiment w/ this under the name "datasets-wip-211"

@mattijn
Copy link
Contributor

mattijn commented Feb 7, 2025

Thanks! Appreciate you having this on your list. Great that we’re aligned.

@dangotbanned
Copy link
Member Author

Thanks! Appreciate you having this on your list. Great that we’re aligned.

No worries @mattijn, sorry for not communicating any of this sooner.
Thanks for taking the time to go through this PR

dangotbanned added a commit that referenced this pull request Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants