DOC Replace the MovieLens dataset with the fraud dataset #1053

Vincent-Maladiere · 2024-09-03T16:24:54Z

Addresses #1046

This PR proposes replacing the MovieLens dataset used in the AggJoiner example with the fraud dataset from Figshare. It highlights the performance gained when using AggJoiner for a modeling problem involving tables with one-to-many relationships. The previous problem of self-join was somewhat niche and not representative of the majority of the AggJoiner use cases.

It also discusses the current limitations of both Joiner and AggJoiner when it comes to deploying models.

jeromedockes

thanks @Vincent-Maladiere , this is a really cool example! it showcases several important features while offering a more realistic scenario than the previous one.

the drawback of such a rich example is that it is very complex and thus can be a bit scary for users. Indeed it touches on several fine and somewhat orthogonal points:

the representation of the input data, normalized vs sparse/concatenated
using a business metric and the tuned threshold (which in turn triggers the introduction of the metadata routing)
appropriate encoding of the different columns (some need target encoding, other minhashing)
what minhash really does in order to understand why the appropriate aggregation function is min()
how to use the aggjoiner

all of these are definitely interesting and at the center of skrub's focus, but it seems like a lot.

I wonder if we should have a simple version, which starts from the 2-table dataset (ie just the last section), and uses the AUC while stating why it is suboptimal, and a complete version in an examples "advanced" section. WDYT?

examples/08_join_aggregation.py

Vincent-Maladiere · 2024-09-05T14:32:55Z

Thanks for the in-depth review @jeromedockes.

To answer your main concern, there is indeed a tradeoff between using an example that is not 100% trivial, which supports the case for a more realistic example gallery, and having a longer size or slightly higher complexity. In our IRL conversation, @glemaitre seemed to lean towards something more realistic. I agree that this could be simplified further, though. We need to decide where to put the cursor.

Vincent-Maladiere · 2024-09-05T14:44:02Z

But I agree with you that we could add another, simpler example for the AggJoiner while using this one to build a more realistic use case.

What do you all think?

GaelVaroquaux · 2024-09-05T15:59:04Z

I think that Jerome's suggestion to have two examples is a good one. It's useful to have an example that is not too scary

Vincent-Maladiere · 2024-09-10T16:31:31Z

I simplified the example and put the previous long one in the FIXME folder. We can adapt it for an "advanced" example section later.

Vincent-Maladiere · 2024-09-11T08:41:24Z

I need to add a few tests for the new dataset fetching function.

Also, the notebook takes 2min30 to run on the CI (the randomized search cv being the culprit). When I remove it, the model's performance gets worryingly close to the dummy model's. Should I hardcode the good hyper-parameters in the model instead, with a comment saying we found them using RandomizedSearchCV?

jeromedockes · 2024-09-12T13:34:32Z

maybe we can keep the hyperparam search and shave some time off another example such as the "multiple key join" one :)

jeromedockes · 2024-09-12T13:37:52Z

examples/08_join_aggregation.py

-feature engineering in a simple way using |AggJoiner| and |AggTarget|.
-Note that our performance is higher than the baseline of using the mean
-rating per movies.
+The data consists of two distinct entities: e-commerce "baskets", and "products".


maybe repeat here "one" basket to "many" products for readers who aren't familiar with the one-to-many expression. or even provide an example in the first sentence

Vincent-Maladiere · 2024-09-13T12:34:35Z

I need to convert the parquet dataset into csv on Figshare to remove the "pyarrow not installed" error in the CI (unless we want to add a mandatory dependency to pyarrow?)

jeromedockes · 2024-09-13T12:44:44Z

I saw this error before but I don't understand because even if pyarrow is an optional dependency it is supposed to be listed in the env of the doc build...

jeromedockes · 2024-09-13T12:45:49Z

look for some keywords to nerd-snipe @glemaitre such as #pixi #ci

Vincent-Maladiere · 2024-09-13T12:58:33Z

Yes, but pyarrow is not listed in the test build, hence the mistake, I guess (notice that I added code outside of the doc/ dir to define fetch_credit_fraud).

I'm concerned about how we fetch datasets because we use pyarrow optionally, so 100% of users will first run into an error when they want to play with the datasets from Figshare. So either we put pyarrow as a mandatory dep or convert all datasets on Figshare to CSV (in another PR).

WDYT?

jeromedockes · 2024-09-13T13:05:16Z

ah right sorry I missed the fact this isn't for the example but the test. I think pyarrow as a mandatory dependency might scare off some potential users because it used to be a very heavy one with lots of dependencies of its own. I believe it has improved in that respect but still I would give it a little time

maybe we could have something like you must have either polars or pandas+pyarrow

but in the meanwhile I think your suggestion of storing the datasets as csv is a good one. some of them such as the airplanes may need to be subsampled

jeromedockes · 2024-09-13T13:06:41Z

But maybe someone who was there when the first datasets were uploaded such as @GaelVaroquaux or @jovan-stojanovic would have a better idea or more insights on this choice

Vincent-Maladiere · 2024-09-13T15:34:03Z

Although I updated the pixi lock and added optional to test, I still have the missing pyarrow error 🤔
@glemaitre WDYT?

glemaitre · 2024-09-03T21:18:00Z

doc/conf.py

@@ -482,6 +482,7 @@ def notebook_modification_function(notebook_content, notebook_filename):
    "ColumnTransformer": "sklearn.compose.ColumnTransformer",
    "OneHotEncoder": "sklearn.preprocessing.OneHotEncoder",
    "Pipeline": "sklearn.pipeline.Pipeline",
+    "make_pipeline": "sklearn.pipeline.make_pipeline",


I would stop adding those. In the RST file, you can directly link with

:func:`~sklearn.pipeline.make_pipeline`

With the ~, it will only display make_pipeline. If you remove the ~ then it will show sklearn.pipeline.make_pipeline.

I think that we should remove all the list and use normal sphix command.

Ok great, I can remove this line and open an issue to clean this in another PR? WDYT?

pyproject.toml

glemaitre · 2024-09-13T15:56:01Z

Although I updated the pixi lock and added optional to test, I still have the missing pyarrow error 🤔
@glemaitre WDYT?

I think that there is something wrong with the test CI: it also test the documentation. So when we have the min-dependencies without the optional dependencies because we run the example that requires more tests it fails.

So we should not be testing the documentation as part of the "test" builds but only for the ci dedicated to the documentation that relies on the doc environment containing as well the optional dependencies.

skrub/datasets/tests/test_fetching.py

Vincent-Maladiere · 2024-09-19T15:08:39Z

@jeromedockes Finally I kept the files as parquet, we can address that issue later.

TheooJ

I have a couple of comments (mainly on typos and layout). I think the example is a bit complicated, and if I were a user trying to understand how it works "quickly" I could be a bit discouraged seeing this. I was wondering if we should use a full numeric dataset for the AggJoiner example (I'm on the lookout for such a dataset ATM), and use MinHash for the fraud, let me know what you think.

skrub/datasets/_fetching.py

examples/08_join_aggregation.py

TheooJ · 2024-09-19T20:27:33Z

examples/08_join_aggregation.py

+# on the basket dataframe, still on the basket ID.
+#
+# AggJoiner can help us achieve exactly this. We need to pass the product dataframe as
+# an auxiliary table argument to AggJoiner in ``__init__``. ``aux_key`` represent both


I'm not sure I'd mention __init__, do we expect people to know this ? Completely fine with leaving it if we do

examples/08_join_aggregation.py

TheooJ · 2024-09-19T20:29:27Z

examples/08_join_aggregation.py

+from skrub import AggJoiner
+from skrub import _selectors as s
+
+minhash_cols_query = s.glob("item*") | s.glob("model*")


Maybe we should use a search of columns by hand if people are not familiar with selectors

that's true but on the other hand it will add a bunch of annoying / distracting boilerplate. maybe we can add a comment saying what it does.

it will get nicer if the aggjoiner accepts a selector directly for the cols=; this will remove the need to do the s.select ourselves

+1 for a clear comment

TheooJ · 2024-09-19T20:31:43Z

examples/08_join_aggregation.py

+# For the second AggJoiner, we use the mean, standard deviation, minimum and maximum
+# operations to extract a representative summary of each distribution.
+#
+# DropCols is another sklearn transformer which removes the "ID" column, which doesn't


I would link DropCols + some of the other transformers above such as MinHashEncoder

examples/08_join_aggregation.py

jeromedockes · 2024-09-27T14:36:02Z

examples/08_join_aggregation.py

+from skrub import AggJoiner
+from skrub import _selectors as s
+
+minhash_cols_query = s.glob("item*") | s.glob("model*")


that's true but on the other hand it will add a bunch of annoying / distracting boilerplate. maybe we can add a comment saying what it does.

it will get nicer if the aggjoiner accepts a selector directly for the cols=; this will remove the need to do the s.select ourselves

examples/08_join_aggregation.py

jeromedockes · 2024-09-27T14:54:15Z

I have a couple of comments (mainly on typos and layout). I think the example is a bit complicated, and if I were a user trying to understand how it works "quickly" I could be a bit discouraged seeing this.

I see what you mean but it's also nice to have an interesting / semi-realistic example. for users who just want to check quickly how to initialize a joiner on toy data there are the examples in the reference docstring (which may need to be expanded).

I would say the example as it is strikes a nice balance. maybe I would give a way to users who don't care about the "naive" approach a way to skip it -- either a <details> or a sentence like "if you're not interested in seeing the wrong way and just want to jump to the right solution with the aggjoiner skip this section by jumping here"

I was wondering if we should use a full numeric dataset for the AggJoiner example (I'm on the lookout for such a dataset ATM), and use MinHash for the fraud, let me know what you think.

finding new datasets is definitely super useful 💯 but I wouldn't block this pr waiting for another dataset, the example is significantly nicer than the one in the main branch so i would be in favor of merging once the smaller concerns are addressed and opening a new pr later if needed (perhaps when the "full " version moves out of FIXME)

Co-authored-by: Théo Jolivet <[email protected]> Co-authored-by: Jérôme Dockès <[email protected]>

TheooJ · 2024-09-30T07:56:12Z

I agree it's much nicer than the example on the main branch, let's merge it now that the smaller changes are addressed

Co-authored-by: Théo Jolivet <[email protected]>

Vincent-Maladiere · 2024-09-30T12:51:50Z

Ok, TODO before merging:

A comment on the selector
Linking entities like DropCols
Add a paragraph to skip the naive approach
Update pixi.lock

Vincent-Maladiere · 2024-09-30T13:54:07Z

I'm eager to try adding some diagrams to make the example more telling

Vincent-Maladiere · 2024-09-30T14:52:45Z

WDYT?

Data schema

AggJoiner

jeromedockes · 2024-09-30T14:59:14Z

I like the idea of having a schema! actually we may want to have it in the reference doc or user guide as well, and to have similar ones for some other joiners.

I have some small feedback about the diagram itself but it might be easier to do if we can wave and point at stuff, maybe we can talk about it on video call or IRL?

TheooJ · 2024-09-30T15:06:27Z

I also like the idea ! Also +1 for having it somewhere in the user guide or on a specific page

For some inspiration : https://pandas.pydata.org/docs/user_guide/reshaping.html

Vincent-Maladiere · 2024-10-01T08:13:25Z

Update

Data schema

AggJoiner

jeromedockes · 2024-10-01T08:21:05Z

Update

nice! LGTM

jeromedockes · 2024-10-02T10:11:15Z

great, merging! very nice addition :)

Vincent-Maladiere added 5 commits September 3, 2024 09:54

tmp commit

3dc2ec0

replace Movielens with the fraud dataset in the AggJoiner example

b096d08

forgot to remove a tmp folder

8426358

new setup, forgot to run linter

71efb50

small fix for the tradeoff plot

d7754e3

glemaitre changed the title ~~[DOC] Replace the MovieLens dataset with the fraud dataset~~ DOC Replace the MovieLens dataset with the fraud dataset Sep 3, 2024

small annotation on the cost matrix

b507353

jeromedockes reviewed Sep 5, 2024

View reviewed changes

jeromedockes added this to the 0.4.0 milestone Sep 9, 2024

Vincent-Maladiere added 2 commits September 10, 2024 12:02

Merge branch 'main' into improve_agg_joiner_example

e2f46c8

simplify agg joiner example

6578d66

Vincent-Maladiere added 2 commits September 10, 2024 18:42

reduce CV duration

5c740c5

remove n_jobs

b30bfda

add doc and tests

c501fab

Vincent-Maladiere added 3 commits September 13, 2024 10:52

add changes

20ea258

Merge branch 'main' into improve_agg_joiner_example

dab156c

add pyarrow doc deps

acf87cd

jeromedockes reviewed Sep 13, 2024

View reviewed changes

Vincent-Maladiere added 3 commits September 13, 2024 15:16

put optional in test

1746f28

Merge branch 'main' into improve_agg_joiner_example

ff251a0

update wording and pixi.lock

8be9f99

glemaitre reviewed Sep 13, 2024

View reviewed changes

skrub/datasets/tests/test_fetching.py Show resolved Hide resolved

Vincent-Maladiere added 2 commits September 16, 2024 13:11

add glemaitre suggestion

e18769c

apply glemaitre feedback

e9fe24d

TheooJ reviewed Sep 19, 2024

View reviewed changes

jeromedockes reviewed Sep 27, 2024

View reviewed changes

Apply suggestions from code review

11b5c8c

Co-authored-by: Théo Jolivet <[email protected]> Co-authored-by: Jérôme Dockès <[email protected]>

TheooJ approved these changes Sep 30, 2024

View reviewed changes

Update skrub/datasets/_fetching.py

41cae0f

Co-authored-by: Théo Jolivet <[email protected]>

Merge branch 'main' into improve_agg_joiner_example

3170b91

add schemas

04ebad3

jeromedockes merged commit b4e1e77 into skrub-data:main Oct 2, 2024
22 checks passed

DOC Replace the MovieLens dataset with the fraud dataset #1053

DOC Replace the MovieLens dataset with the fraud dataset #1053

Conversation

Vincent-Maladiere commented Sep 3, 2024 • edited Loading

jeromedockes left a comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Sep 5, 2024

Vincent-Maladiere commented Sep 5, 2024

GaelVaroquaux commented Sep 5, 2024 via email

Vincent-Maladiere commented Sep 10, 2024

Vincent-Maladiere commented Sep 11, 2024

jeromedockes commented Sep 12, 2024

Choose a reason for hiding this comment

Vincent-Maladiere commented Sep 13, 2024 • edited Loading

jeromedockes commented Sep 13, 2024

jeromedockes commented Sep 13, 2024

Vincent-Maladiere commented Sep 13, 2024

jeromedockes commented Sep 13, 2024

jeromedockes commented Sep 13, 2024

Vincent-Maladiere commented Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Sep 13, 2024

Vincent-Maladiere commented Sep 19, 2024

TheooJ left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheooJ Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Sep 27, 2024

TheooJ commented Sep 30, 2024

Vincent-Maladiere commented Sep 30, 2024 • edited Loading

Vincent-Maladiere commented Sep 30, 2024

Vincent-Maladiere commented Sep 30, 2024

Data schema

AggJoiner

jeromedockes commented Sep 30, 2024

TheooJ commented Sep 30, 2024

Vincent-Maladiere commented Oct 1, 2024

Data schema

AggJoiner

jeromedockes commented Oct 1, 2024

jeromedockes commented Oct 2, 2024

Vincent-Maladiere commented Sep 3, 2024 •

edited

Loading

Vincent-Maladiere commented Sep 13, 2024 •

edited

Loading

Vincent-Maladiere commented Sep 13, 2024 •

edited

Loading

TheooJ Sep 19, 2024 •

edited

Loading

Vincent-Maladiere commented Sep 30, 2024 •

edited

Loading