Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update indexing notebook #73

Merged
merged 12 commits into from
Feb 9, 2024
Merged

update indexing notebook #73

merged 12 commits into from
Feb 9, 2024

Conversation

PhilippeMoussalli
Copy link
Contributor

Updated indexing notebook.

For now we're using the version from this PR to allow using resuable components in notebooks.

  • Converted the custom cleaning component and chunking component to lightweight components.
  • Added TODOs to change after having a stable release

Future Todos:

  • There is an error when using the custom cleaning component related to the way the component text is parsed (related to presence of \n. Requires further investigation. (This PR)
  • Change write index components to lightweight component (This PR)
  • Switch over the two remaining notebooks ti lightweight components as well (future PRs)

Copy link
Collaborator

@mrchtr mrchtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the changes. Imo, we still need to define which components we are transforming into lightweight components and which we keep as reusable.

I would keep the chunking component and the write to Weaviate as reusable. Any arguments for why we should implement these as custom components?

The aggregate_eval_results and text_cleaning seem to be good examples for custom components. I would only convert them.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrchtr!

The not so nice thing about having these components in separate files, is that we still implement them as if they are reusable. We are still wrapping a not-so-nice interface around existing libraries, while I would just "hardcode" the arguments used into the components since they're lightweight. This would simplify the component a lot, but probably works better if the components are inline.

CC: @GeorgesLorre

@@ -1,3 +1,4 @@
fondant==0.9.0
fondant[component,aws,azure,gcp,docker]==0.10.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we might need this in the other use cases as well. Didn't run into them, but might be because I already had the dependencies in my environment. Will try to re-run with a clean one.

requirements.txt Outdated
notebook==7.0.6
weaviate-client==3.25.3
langchain==0.0.329
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this here if it's in the extra_requires of the component?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we do have langchain as extra requirement. But at the moment we are not installing the extra requirements in the local environment. When I am in the notebook, and want to test the class implementation, I will get an exception that the dependency is missing. I think this is something we should takle in #830

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are not doing the same for eg. Ragas?

Since we are not actually executing the components in the notebook at this moment, I would leave them out of the requirements.txt.

@GeorgesLorre
Copy link
Contributor

The not so nice thing about having these components in separate files, is that we still implement them as if they are reusable. We are still wrapping a not-so-nice interface around existing libraries, while I would just "hardcode" the arguments used into the components since they're lightweight. This would simplify the component a lot, but probably works better if the components are inline.

Thx @mrchtr, looking good. But I do tend to agree here that is almost to nicely abstracted now, they are reusable components with a decorator on top now. I don't feel like we get any of the lightweight component benefits. I would really try to dumb them down by defining them inline (in the notebook). Removing optional functionality. Hardcoding most of the arguments and by doing so removing most of the docstrings. This should make them more compact and understandable at a glance.

Let me know if this does not make sense.

requirements.txt Outdated
notebook==7.0.6
weaviate-client==3.25.3
langchain==0.0.329
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are not doing the same for eg. Ragas?

Since we are not actually executing the components in the notebook at this moment, I would leave them out of the requirements.txt.

src/evaluation.ipynb Outdated Show resolved Hide resolved
src/parameter_search.ipynb Outdated Show resolved Hide resolved
@mrchtr
Copy link
Collaborator

mrchtr commented Feb 8, 2024

The not so nice thing about having these components in separate files, is that we still implement them as if they are reusable. We are still wrapping a not-so-nice interface around existing libraries, while I would just "hardcode" the arguments used into the components since they're lightweight. This would simplify the component a lot, but probably works better if the components are inline.

Thx @mrchtr, looking good. But I do tend to agree here that is almost to nicely abstracted now, they are reusable components with a decorator on top now. I don't feel like we get any of the lightweight component benefits. I would really try to dumb them down by defining them inline (in the notebook). Removing optional functionality. Hardcoding most of the arguments and by doing so removing most of the docstrings. This should make them more compact and understandable at a glance.

Let me know if this does not make sense.

Alright, reduced the notebooks to a minimum.
Should we remove the python pipelines completely? There are not needed anymore when we don't support the parameter search.

@RobbeSneyders
Copy link
Member

Should we remove the python pipelines completely? There are not needed anymore when we don't support the parameter search.

I'm a fan of having both in our use cases, the notebooks as explanation, and the python files to actually build on. On the other hand, I see the duplication we'd be introducing.

WDYT @GeorgesLorre?

@mrchtr mrchtr merged commit f6c4e5d into main Feb 9, 2024
1 check passed
@mrchtr mrchtr deleted the bump-to-0.10.0 branch February 9, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants