Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC Replace the MovieLens dataset with the fraud dataset #1053

Merged

Conversation

Vincent-Maladiere
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere commented Sep 3, 2024

Addresses #1046

This PR proposes replacing the MovieLens dataset used in the AggJoiner example with the fraud dataset from Figshare. It highlights the performance gained when using AggJoiner for a modeling problem involving tables with one-to-many relationships. The previous problem of self-join was somewhat niche and not representative of the majority of the AggJoiner use cases.

It also discusses the current limitations of both Joiner and AggJoiner when it comes to deploying models.

@glemaitre glemaitre changed the title [DOC] Replace the MovieLens dataset with the fraud dataset DOC Replace the MovieLens dataset with the fraud dataset Sep 3, 2024
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Vincent-Maladiere , this is a really cool example! it showcases several important features while offering a more realistic scenario than the previous one.

the drawback of such a rich example is that it is very complex and thus can be a bit scary for users. Indeed it touches on several fine and somewhat orthogonal points:

  • the representation of the input data, normalized vs sparse/concatenated
  • using a business metric and the tuned threshold (which in turn triggers the introduction of the metadata routing)
  • appropriate encoding of the different columns (some need target encoding, other minhashing)
  • what minhash really does in order to understand why the appropriate aggregation function is min()
  • how to use the aggjoiner

all of these are definitely interesting and at the center of skrub's focus, but it seems like a lot.

I wonder if we should have a simple version, which starts from the 2-table dataset (ie just the last section), and uses the AUC while stating why it is suboptimal, and a complete version in an examples "advanced" section. WDYT?

examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
@Vincent-Maladiere
Copy link
Member Author

Thanks for the in-depth review @jeromedockes.

To answer your main concern, there is indeed a tradeoff between using an example that is not 100% trivial, which supports the case for a more realistic example gallery, and having a longer size or slightly higher complexity. In our IRL conversation, @glemaitre seemed to lean towards something more realistic. I agree that this could be simplified further, though. We need to decide where to put the cursor.

@Vincent-Maladiere
Copy link
Member Author

But I agree with you that we could add another, simpler example for the AggJoiner while using this one to build a more realistic use case.

What do you all think?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 5, 2024 via email

@jeromedockes jeromedockes added this to the 0.4.0 milestone Sep 9, 2024
@Vincent-Maladiere
Copy link
Member Author

I simplified the example and put the previous long one in the FIXME folder. We can adapt it for an "advanced" example section later.

@Vincent-Maladiere
Copy link
Member Author

I need to add a few tests for the new dataset fetching function.

Also, the notebook takes 2min30 to run on the CI (the randomized search cv being the culprit). When I remove it, the model's performance gets worryingly close to the dummy model's. Should I hardcode the good hyper-parameters in the model instead, with a comment saying we found them using RandomizedSearchCV?

@jeromedockes
Copy link
Member

maybe we can keep the hyperparam search and shave some time off another example such as the "multiple key join" one :)

feature engineering in a simple way using |AggJoiner| and |AggTarget|.
Note that our performance is higher than the baseline of using the mean
rating per movies.
The data consists of two distinct entities: e-commerce "baskets", and "products".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe repeat here "one" basket to "many" products for readers who aren't familiar with the one-to-many expression. or even provide an example in the first sentence

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Sep 13, 2024

I need to convert the parquet dataset into csv on Figshare to remove the "pyarrow not installed" error in the CI (unless we want to add a mandatory dependency to pyarrow?)

@jeromedockes
Copy link
Member

I saw this error before but I don't understand because even if pyarrow is an optional dependency it is supposed to be listed in the env of the doc build...

@jeromedockes
Copy link
Member

look for some keywords to nerd-snipe @glemaitre such as #pixi #ci

@Vincent-Maladiere
Copy link
Member Author

Yes, but pyarrow is not listed in the test build, hence the mistake, I guess (notice that I added code outside of the doc/ dir to define fetch_credit_fraud).

I'm concerned about how we fetch datasets because we use pyarrow optionally, so 100% of users will first run into an error when they want to play with the datasets from Figshare. So either we put pyarrow as a mandatory dep or convert all datasets on Figshare to CSV (in another PR).

WDYT?

@jeromedockes
Copy link
Member

ah right sorry I missed the fact this isn't for the example but the test. I think pyarrow as a mandatory dependency might scare off some potential users because it used to be a very heavy one with lots of dependencies of its own. I believe it has improved in that respect but still I would give it a little time

maybe we could have something like you must have either polars or pandas+pyarrow

but in the meanwhile I think your suggestion of storing the datasets as csv is a good one. some of them such as the airplanes may need to be subsampled

@jeromedockes
Copy link
Member

But maybe someone who was there when the first datasets were uploaded such as @GaelVaroquaux or @jovan-stojanovic would have a better idea or more insights on this choice

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Sep 13, 2024

Although I updated the pixi lock and added optional to test, I still have the missing pyarrow error 🤔
@glemaitre WDYT?

doc/conf.py Outdated
@@ -482,6 +482,7 @@ def notebook_modification_function(notebook_content, notebook_filename):
"ColumnTransformer": "sklearn.compose.ColumnTransformer",
"OneHotEncoder": "sklearn.preprocessing.OneHotEncoder",
"Pipeline": "sklearn.pipeline.Pipeline",
"make_pipeline": "sklearn.pipeline.make_pipeline",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would stop adding those. In the RST file, you can directly link with

:func:`~sklearn.pipeline.make_pipeline`

With the ~, it will only display make_pipeline. If you remove the ~ then it will show sklearn.pipeline.make_pipeline.

I think that we should remove all the list and use normal sphix command.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, I can remove this line and open an issue to clean this in another PR? WDYT?

pyproject.toml Show resolved Hide resolved
@glemaitre
Copy link
Member

Although I updated the pixi lock and added optional to test, I still have the missing pyarrow error 🤔
@glemaitre WDYT?

I think that there is something wrong with the test CI: it also test the documentation. So when we have the min-dependencies without the optional dependencies because we run the example that requires more tests it fails.

So we should not be testing the documentation as part of the "test" builds but only for the ci dedicated to the documentation that relies on the doc environment containing as well the optional dependencies.

@Vincent-Maladiere
Copy link
Member Author

@jeromedockes Finally I kept the files as parquet, we can address that issue later.

Copy link
Contributor

@TheooJ TheooJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments (mainly on typos and layout). I think the example is a bit complicated, and if I were a user trying to understand how it works "quickly" I could be a bit discouraged seeing this. I was wondering if we should use a full numeric dataset for the AggJoiner example (I'm on the lookout for such a dataset ATM), and use MinHash for the fraud, let me know what you think.

skrub/datasets/_fetching.py Outdated Show resolved Hide resolved
skrub/datasets/_fetching.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
examples/08_join_aggregation.py Outdated Show resolved Hide resolved
# on the basket dataframe, still on the basket ID.
#
# AggJoiner can help us achieve exactly this. We need to pass the product dataframe as
# an auxiliary table argument to AggJoiner in ``__init__``. ``aux_key`` represent both
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd mention __init__, do we expect people to know this ? Completely fine with leaving it if we do

examples/08_join_aggregation.py Outdated Show resolved Hide resolved
from skrub import AggJoiner
from skrub import _selectors as s

minhash_cols_query = s.glob("item*") | s.glob("model*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a search of columns by hand if people are not familiar with selectors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true but on the other hand it will add a bunch of annoying / distracting boilerplate. maybe we can add a comment saying what it does.

it will get nicer if the aggjoiner accepts a selector directly for the cols=; this will remove the need to do the s.select ourselves

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a clear comment

# For the second AggJoiner, we use the mean, standard deviation, minimum and maximum
# operations to extract a representative summary of each distribution.
#
# DropCols is another sklearn transformer which removes the "ID" column, which doesn't
Copy link
Contributor

@TheooJ TheooJ Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would link DropCols + some of the other transformers above such as MinHashEncoder

examples/08_join_aggregation.py Outdated Show resolved Hide resolved
from skrub import AggJoiner
from skrub import _selectors as s

minhash_cols_query = s.glob("item*") | s.glob("model*")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true but on the other hand it will add a bunch of annoying / distracting boilerplate. maybe we can add a comment saying what it does.

it will get nicer if the aggjoiner accepts a selector directly for the cols=; this will remove the need to do the s.select ourselves

examples/08_join_aggregation.py Outdated Show resolved Hide resolved
@jeromedockes
Copy link
Member

I have a couple of comments (mainly on typos and layout). I think the example is a bit complicated, and if I were a user trying to understand how it works "quickly" I could be a bit discouraged seeing this.

I see what you mean but it's also nice to have an interesting / semi-realistic example. for users who just want to check quickly how to initialize a joiner on toy data there are the examples in the reference docstring (which may need to be expanded).

I would say the example as it is strikes a nice balance. maybe I would give a way to users who don't care about the "naive" approach a way to skip it -- either a <details> or a sentence like "if you're not interested in seeing the wrong way and just want to jump to the right solution with the aggjoiner skip this section by jumping here"

I was wondering if we should use a full numeric dataset for the AggJoiner example (I'm on the lookout for such a dataset ATM), and use MinHash for the fraud, let me know what you think.

finding new datasets is definitely super useful 💯 but I wouldn't block this pr waiting for another dataset, the example is significantly nicer than the one in the main branch so i would be in favor of merging once the smaller concerns are addressed and opening a new pr later if needed (perhaps when the "full " version moves out of FIXME)

Co-authored-by: Théo Jolivet <[email protected]>
Co-authored-by: Jérôme Dockès <[email protected]>
@TheooJ
Copy link
Contributor

TheooJ commented Sep 30, 2024

I agree it's much nicer than the example on the main branch, let's merge it now that the smaller changes are addressed

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Sep 30, 2024

Ok, TODO before merging:

  • A comment on the selector
  • Linking entities like DropCols
  • Add a paragraph to skip the naive approach
  • Update pixi.lock

@Vincent-Maladiere
Copy link
Member Author

I'm eager to try adding some diagrams to make the example more telling

@Vincent-Maladiere
Copy link
Member Author

WDYT?

Data schema

Screenshot 2024-09-30 at 16 48 46

AggJoiner

Screenshot 2024-09-30 at 16 50 52

@jeromedockes
Copy link
Member

I like the idea of having a schema! actually we may want to have it in the reference doc or user guide as well, and to have similar ones for some other joiners.

I have some small feedback about the diagram itself but it might be easier to do if we can wave and point at stuff, maybe we can talk about it on video call or IRL?

@TheooJ
Copy link
Contributor

TheooJ commented Sep 30, 2024

I also like the idea ! Also +1 for having it somewhere in the user guide or on a specific page

For some inspiration : https://pandas.pydata.org/docs/user_guide/reshaping.html

@Vincent-Maladiere
Copy link
Member Author

Update

Data schema

Screenshot 2024-10-01 at 10 11 24

AggJoiner

Screenshot 2024-10-01 at 10 11 43

@jeromedockes
Copy link
Member

Update

nice! LGTM

@jeromedockes
Copy link
Member

great, merging! very nice addition :)

@jeromedockes jeromedockes merged commit b4e1e77 into skrub-data:main Oct 2, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants