feat(RFC): Adds `altair.datasets` #3631

dangotbanned · 2024-10-04T18:57:00Z

Tracking

Waiting on the next vega-datasets release.
~~Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.~~

Discovered a bug that makes some handling of expressions a little less efficient.

[Bug]: Missing handling for Iterator[IntoExpr] narwhals-dev/narwhals#1897

Upstreaming some nw.Schema stuff to narwhals

Improve user-facing interface

feat(RFC): Adds altair.datasets #3631 (comment)

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

No datasets are included in the package
- Instead, several metadata files form a dense summary of vega-datasets/datapackage.json
- 3 files provide a reduction (70-150kb -> 15kb) and optimized views for this particular use-case
- Includes redundancies for missing dependencies
Strong support for typing
- Annotations are generated from the metadata itself
- https://github.com/vega/altair/blob/9e9deeb95668d2c4e7d30311e85a8f9f6acdc88c/altair/datasets/_typing.py
So far, 4 backends have been implemented, instead of only pandas
- These provide precise IDE completions, with a lot of help from https://github.com/narwhals-dev/narwhals
Users can opt-out of caching remote dataset requests
- With the "polars" backend, the slowest I've had on a cache-hit is 0.1s to load
  - https://cdn.jsdelivr.net/npm/[email protected]/data/flights-200k.json

Examples

These all come from the docstrings of:

Loader
Loader.from_backend
Loader.__call__

from altair.datasets import Loader

load = Loader.from_backend("polars")
>>> load
Loader[polars]

cars = load("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

load = Loader.from_backend("pandas")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

load = Loader.from_backend("pandas[pyarrow]")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

load = Loader.from_backend("pandas")
source = load("stocks")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

load = Loader.from_backend("pyarrow")
source = load("stocks")

>>> source.column_names
['symbol', 'date', 'price']

- Allow quickly switching between version tags #3150 (comment)

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

Not required for these requests, but may be helpful to avoid limits

As an example, for comparing against the most recent I've added the 5 most recent

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

Experimenting with querying the url cache w/ expressions

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

- Will be even more useful after merging vega/vega-datasets#663 - Thinking this is a fair tradeoff vs inlining the descriptions into `altair` - All the info is available and it is quicker than manually searching the headings in a browser

Resolves #3631 (comment)

altair/datasets/_reader.py

…cutils] See https://altair-viz.github.io/user_guide/generated/theme/altair.theme.RowColKwds.html Discovered this fix in #3631 (comment)

#3631 (comment)

Contains a fix (narwhals-dev/narwhals#1934) for #3631 (comment)

narwhals-dev/narwhals#1934

See narwhals-dev/narwhals#1888

@MarcoGorelli

Possible since narwhals-dev/narwhals#1930 @MarcoGorelli if you're interested what that PR did (besides fix warnings 😉)

Made possible via vega/vega-datasets#681 - Removes temp files - Removes some outdated apis - Remove test based on removed `"points"` dataset

mattijn · 2025-02-07T18:06:52Z

I’ve added an item to the tracking list in OP. I raised this suggestion during the review as an important improvement to the user interface. I’d be grateful if it could be addressed before merging and not is overlooked. Almost there!🙌

dangotbanned · 2025-02-07T18:31:38Z

I’ve added an item to the tracking list in OP. I raised this suggestion during the review as an important improvement to the user interface. I’d be grateful if it could be addressed before merging and not is overlooked. Almost there!🙌

Thanks @mattijn

I actually have that one bookmarked, rest assured I have not forgotten!

This was what I was thinking of doing, while we wait on narwhals-dev/narwhals#1924 and vega/vega-datasets#654:

converting (`vega_datasets` source #3150) into an issue
adding some background on the larger changes since my initial RFC
continue the discussion on
- feat(RFC): Adds altair.datasets #3631 (comment)
- https://github.com/vega/altair/pull/3631/files/7b3a89e5b5374eb391b7ae73ace219327069f979#r1848299112

FYI, I have tried responding to (#3631 (comment)) twice but both times I ended up writing waaaaay too much.
I've still got those in-progress responses - but hopefully I'll be able to trim it down into something more digestible

Note

(to self) stashed an experiment w/ this under the name "datasets-wip-211"

mattijn · 2025-02-07T18:47:41Z

Thanks! Appreciate you having this on your list. Great that we’re aligned.

dangotbanned · 2025-02-07T19:01:38Z

Thanks! Appreciate you having this on your list. Great that we’re aligned.

No worries @mattijn, sorry for not communicating any of this sooner.
Thanks for taking the time to go through this PR

Will unblock (#3631 (comment))

Related - narwhals-dev/narwhals#1924 - #3631 (comment)

Makes more sense following (755ab4f)

dangotbanned added 6 commits October 2, 2024 22:13

wip

7933771

feat(DRAFT): Minimal reimplementation

b30081e

refactor: Make version accessible via data.source_tag

279586b

- Allow quickly switching between version tags #3150 (comment)

refactor: ext_fn -> Dataset.read_fn

32150ad

docs: Add trailing docs to long literals

f1d18a2

docs: Add module-level doc

4d3c550

dangotbanned added the maintenance label Oct 4, 2024

dangotbanned added 23 commits October 4, 2024 20:15

Merge branch 'main' into vega-datasets

7e65841

Merge branch 'main' into vega-datasets

05773af

Merge branch 'main' into vega-datasets

4fff80a

feat: Adds .arrow support

3a284a5

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

feat: Add support for caching metadata

22a5039

feat: Support env var VEGA_GITHUB_TOKEN

a618ffc

Not required for these requests, but may be helpful to avoid limits

feat: Add support for multi-version metadata

1792340

As an example, for comparing against the most recent I've added the 5 most recent

refactor: Renaming, docs, reorganize

fa2c9e7

feat: Support collecting release tags

24cd7d7

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

feat: Adds refresh_tags

7dd461f

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

feat(DRAFT): Adds url_from

9768495

Experimenting with querying the url cache w/ expressions

fix: Wrap all requests with auth

c38c235

chore: Remove DATASET_NAMES_USED

a22cc8a

feat: Major GitHub rewrite, handle rate limiting

1181860

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

feat(DRAFT): Partial implement data("name")

31eeb20

fix(typing): Resolve some mypy errors

511a845

Merge branch 'main' into vega-datasets

c76cfd4

Merge branch 'main' into vega-datasets

d3f0497

Merge branch 'main' into vega-datasets

1b3390b

fix(ruff): Apply 3.8 fixes

a770ba9

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

docs(typing): Add WorkInProgress marker to data(...)

686a485

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Merge branch 'main' into vega-datasets

ba4491d

Merge branch 'main' into vega-datasets

1a4e107

feat: Adds Reader.open_markdown

7bb6f9e

- Will be even more useful after merging vega/vega-datasets#663 - Thinking this is a fair tradeoff vs inlining the descriptions into `altair` - All the info is available and it is quicker than manually searching the headings in a browser

dangotbanned mentioned this pull request Feb 1, 2025

[Enh]: nw.(DType|Schema) conversion API narwhals-dev/narwhals#1912

Closed

dangotbanned added 2 commits February 1, 2025 20:55

docs: fix typo

760eb66

Resolves #3631 (comment)

Merge remote-tracking branch 'upstream/main' into vega-datasets

94220be

dangotbanned commented Feb 2, 2025

View reviewed changes

altair/datasets/_reader.py Outdated Show resolved Hide resolved

dangotbanned added a commit that referenced this pull request Feb 3, 2025

fix: _typing.RowColKwds:2: ERROR: Unknown target name: "generic". [do…

2433b82

…cutils] See https://altair-viz.github.io/user_guide/generated/theme/altair.theme.RowColKwds.html Discovered this fix in #3631 (comment)

dangotbanned added 3 commits February 3, 2025 21:55

fix: fix typo in error message

cc6d757

#3631 (comment)

Merge remote-tracking branch 'upstream/main' into vega-datasets

1b64392

Merge remote-tracking branch 'upstream/main' into vega-datasets

2bd89aa

dangotbanned added a commit that referenced this pull request Feb 5, 2025

ci: bump narwhals>=1.25.1

1ea8faf

Contains a fix (narwhals-dev/narwhals#1934) for #3631 (comment)

dangotbanned mentioned this pull request Feb 5, 2025

ci: bump narwhals>=1.25.1 #3792

Merged

dangotbanned added 8 commits February 5, 2025 17:57

Merge remote-tracking branch 'upstream/main' into vega-datasets

193fabd

refactor: utilize narwhals fix

6c93eb0

narwhals-dev/narwhals#1934

refactor: utilize nw.Implementation.from_backend

790ff10

See narwhals-dev/narwhals#1888

feat(typing): utilize nw.LazyFrame working TypeVar

8e53848

Possible since narwhals-dev/narwhals#1930 @MarcoGorelli if you're interested what that PR did (besides fix warnings 😉)

Merge remote-tracking branch 'upstream/main' into vega-datasets

e7f7ba8

docs: Show less data in examples

2c3b44d

Merge branch 'main' into vega-datasets

6c724e9

feat: Update for [email protected]

51a967a

Made possible via vega/vega-datasets#681 - Removes temp files - Removes some outdated apis - Remove test based on removed `"points"` dataset

dangotbanned added a commit that referenced this pull request Feb 10, 2025

ci: bump narwhals>=1.26.0

3c5df1b

Will unblock (#3631 (comment))

dangotbanned mentioned this pull request Feb 10, 2025

ci: bump narwhals>=1.26.0 #3800

Merged

dangotbanned added 3 commits February 10, 2025 12:47

Merge remote-tracking branch 'upstream/main' into vega-datasets

037dd29

refactor: replace SchemaCache.schema_pyarrow -> nw.Schema.to_arrow

a776e2f

Related - narwhals-dev/narwhals#1924 - #3631 (comment)

feat(typing): Properly annotate dataset_name, suffix

ddda22c

Makes more sense following (755ab4f)

dangotbanned mentioned this pull request Feb 21, 2025

feat: Add Species Habitat Dataset for Faceted Map Examples vega/vega-datasets#684

Merged

16 tasks

dangotbanned mentioned this pull request Mar 13, 2025

docs: add testing guidelines and create CONTRIBUTING.md vega/vega-datasets#687

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

mattijn commented Feb 7, 2025

dangotbanned commented Feb 7, 2025 •

edited

Loading

mattijn commented Feb 7, 2025

dangotbanned commented Feb 7, 2025

feat(RFC): Adds altair.datasets #3631

Are you sure you want to change the base?

feat(RFC): Adds altair.datasets #3631

Conversation

dangotbanned commented Oct 4, 2024 • edited Loading

Related

Tracking

Description

Examples

mattijn commented Feb 7, 2025

dangotbanned commented Feb 7, 2025 • edited Loading

mattijn commented Feb 7, 2025

dangotbanned commented Feb 7, 2025

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

dangotbanned commented Feb 7, 2025 •

edited

Loading