-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Replace the MovieLens dataset with the fraud dataset #1053
DOC Replace the MovieLens dataset with the fraud dataset #1053
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Vincent-Maladiere , this is a really cool example! it showcases several important features while offering a more realistic scenario than the previous one.
the drawback of such a rich example is that it is very complex and thus can be a bit scary for users. Indeed it touches on several fine and somewhat orthogonal points:
- the representation of the input data, normalized vs sparse/concatenated
- using a business metric and the tuned threshold (which in turn triggers the introduction of the metadata routing)
- appropriate encoding of the different columns (some need target encoding, other minhashing)
- what minhash really does in order to understand why the appropriate aggregation function is min()
- how to use the aggjoiner
all of these are definitely interesting and at the center of skrub's focus, but it seems like a lot.
I wonder if we should have a simple version, which starts from the 2-table dataset (ie just the last section), and uses the AUC while stating why it is suboptimal, and a complete version in an examples "advanced" section. WDYT?
Thanks for the in-depth review @jeromedockes. To answer your main concern, there is indeed a tradeoff between using an example that is not 100% trivial, which supports the case for a more realistic example gallery, and having a longer size or slightly higher complexity. In our IRL conversation, @glemaitre seemed to lean towards something more realistic. I agree that this could be simplified further, though. We need to decide where to put the cursor. |
But I agree with you that we could add another, simpler example for the AggJoiner while using this one to build a more realistic use case. What do you all think? |
I think that Jerome's suggestion to have two examples is a good one. It's useful to have an example that is not too scary
|
I simplified the example and put the previous long one in the FIXME folder. We can adapt it for an "advanced" example section later. |
I need to add a few tests for the new dataset fetching function. Also, the notebook takes 2min30 to run on the CI (the randomized search cv being the culprit). When I remove it, the model's performance gets worryingly close to the dummy model's. Should I hardcode the good hyper-parameters in the model instead, with a comment saying we found them using RandomizedSearchCV? |
maybe we can keep the hyperparam search and shave some time off another example such as the "multiple key join" one :) |
feature engineering in a simple way using |AggJoiner| and |AggTarget|. | ||
Note that our performance is higher than the baseline of using the mean | ||
rating per movies. | ||
The data consists of two distinct entities: e-commerce "baskets", and "products". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe repeat here "one" basket to "many" products for readers who aren't familiar with the one-to-many expression. or even provide an example in the first sentence
I need to convert the parquet dataset into csv on Figshare to remove the "pyarrow not installed" error in the CI (unless we want to add a mandatory dependency to pyarrow?) |
I saw this error before but I don't understand because even if pyarrow is an optional dependency it is supposed to be listed in the env of the doc build... |
look for some keywords to nerd-snipe @glemaitre such as #pixi #ci |
Yes, but pyarrow is not listed in the test build, hence the mistake, I guess (notice that I added code outside of the I'm concerned about how we fetch datasets because we use pyarrow optionally, so 100% of users will first run into an error when they want to play with the datasets from Figshare. So either we put pyarrow as a mandatory dep or convert all datasets on Figshare to CSV (in another PR). WDYT? |
ah right sorry I missed the fact this isn't for the example but the test. I think pyarrow as a mandatory dependency might scare off some potential users because it used to be a very heavy one with lots of dependencies of its own. I believe it has improved in that respect but still I would give it a little time maybe we could have something like you must have either polars or pandas+pyarrow but in the meanwhile I think your suggestion of storing the datasets as csv is a good one. some of them such as the airplanes may need to be subsampled |
But maybe someone who was there when the first datasets were uploaded such as @GaelVaroquaux or @jovan-stojanovic would have a better idea or more insights on this choice |
Although I updated the pixi lock and added optional to test, I still have the missing pyarrow error 🤔 |
doc/conf.py
Outdated
@@ -482,6 +482,7 @@ def notebook_modification_function(notebook_content, notebook_filename): | |||
"ColumnTransformer": "sklearn.compose.ColumnTransformer", | |||
"OneHotEncoder": "sklearn.preprocessing.OneHotEncoder", | |||
"Pipeline": "sklearn.pipeline.Pipeline", | |||
"make_pipeline": "sklearn.pipeline.make_pipeline", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would stop adding those. In the RST file, you can directly link with
:func:`~sklearn.pipeline.make_pipeline`
With the ~
, it will only display make_pipeline
. If you remove the ~
then it will show sklearn.pipeline.make_pipeline
.
I think that we should remove all the list and use normal sphix
command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok great, I can remove this line and open an issue to clean this in another PR? WDYT?
I think that there is something wrong with the test CI: it also test the documentation. So when we have the min-dependencies without the optional dependencies because we run the example that requires more tests it fails. So we should not be testing the documentation as part of the "test" builds but only for the ci dedicated to the documentation that relies on the |
@jeromedockes Finally I kept the files as parquet, we can address that issue later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of comments (mainly on typos and layout). I think the example is a bit complicated, and if I were a user trying to understand how it works "quickly" I could be a bit discouraged seeing this. I was wondering if we should use a full numeric dataset for the AggJoiner example (I'm on the lookout for such a dataset ATM), and use MinHash for the fraud, let me know what you think.
examples/08_join_aggregation.py
Outdated
# on the basket dataframe, still on the basket ID. | ||
# | ||
# AggJoiner can help us achieve exactly this. We need to pass the product dataframe as | ||
# an auxiliary table argument to AggJoiner in ``__init__``. ``aux_key`` represent both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'd mention __init__
, do we expect people to know this ? Completely fine with leaving it if we do
from skrub import AggJoiner | ||
from skrub import _selectors as s | ||
|
||
minhash_cols_query = s.glob("item*") | s.glob("model*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use a search of columns by hand if people are not familiar with selectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true but on the other hand it will add a bunch of annoying / distracting boilerplate. maybe we can add a comment saying what it does.
it will get nicer if the aggjoiner accepts a selector directly for the cols=
; this will remove the need to do the s.select ourselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a clear comment
examples/08_join_aggregation.py
Outdated
# For the second AggJoiner, we use the mean, standard deviation, minimum and maximum | ||
# operations to extract a representative summary of each distribution. | ||
# | ||
# DropCols is another sklearn transformer which removes the "ID" column, which doesn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would link DropCols
+ some of the other transformers above such as MinHashEncoder
from skrub import AggJoiner | ||
from skrub import _selectors as s | ||
|
||
minhash_cols_query = s.glob("item*") | s.glob("model*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true but on the other hand it will add a bunch of annoying / distracting boilerplate. maybe we can add a comment saying what it does.
it will get nicer if the aggjoiner accepts a selector directly for the cols=
; this will remove the need to do the s.select ourselves
I see what you mean but it's also nice to have an interesting / semi-realistic example. for users who just want to check quickly how to initialize a joiner on toy data there are the examples in the reference docstring (which may need to be expanded). I would say the example as it is strikes a nice balance. maybe I would give a way to users who don't care about the "naive" approach a way to skip it -- either a
finding new datasets is definitely super useful 💯 but I wouldn't block this pr waiting for another dataset, the example is significantly nicer than the one in the main branch so i would be in favor of merging once the smaller concerns are addressed and opening a new pr later if needed (perhaps when the "full " version moves out of FIXME) |
Co-authored-by: Théo Jolivet <[email protected]> Co-authored-by: Jérôme Dockès <[email protected]>
I agree it's much nicer than the example on the main branch, let's merge it now that the smaller changes are addressed |
Co-authored-by: Théo Jolivet <[email protected]>
Ok, TODO before merging:
|
I'm eager to try adding some diagrams to make the example more telling |
I like the idea of having a schema! actually we may want to have it in the reference doc or user guide as well, and to have similar ones for some other joiners. I have some small feedback about the diagram itself but it might be easier to do if we can wave and point at stuff, maybe we can talk about it on video call or IRL? |
I also like the idea ! Also +1 for having it somewhere in the user guide or on a specific page For some inspiration : https://pandas.pydata.org/docs/user_guide/reshaping.html |
nice! LGTM |
great, merging! very nice addition :) |
Addresses #1046
This PR proposes replacing the MovieLens dataset used in the AggJoiner example with the fraud dataset from Figshare. It highlights the performance gained when using AggJoiner for a modeling problem involving tables with one-to-many relationships. The previous problem of self-join was somewhat niche and not representative of the majority of the AggJoiner use cases.
It also discusses the current limitations of both Joiner and AggJoiner when it comes to deploying models.