Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add embeddings to TestsetGenerator #1562

Merged
merged 2 commits into from
Oct 24, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 29 additions & 12 deletions src/ragas/testset/synthesizers/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from ragas.executor import Executor
from ragas.llms import BaseRagasLLM, LangchainLLMWrapper
from ragas.run_config import RunConfig
from ragas.embeddings.base import BaseRagasEmbeddings, LangchainEmbeddingsWrapper
from ragas.testset.graph import KnowledgeGraph, Node, NodeType
from ragas.testset.synthesizers import default_query_distribution
from ragas.testset.synthesizers.testset_schema import Testset, TestsetSample
Expand All @@ -22,6 +23,7 @@
from langchain_core.callbacks import Callbacks
from langchain_core.documents import Document as LCDocument
from langchain_core.language_models import BaseLanguageModel as LangchainLLM
from langchain_core.embeddings.embeddings import Embeddings as LangchainEmbeddings

from ragas.embeddings.base import BaseRagasEmbeddings
from ragas.llms.base import BaseRagasLLM
Expand All @@ -42,24 +44,32 @@ class TestsetGenerator:
----------
llm : BaseRagasLLM
The language model to use for the generation process.
embedding_model: BaseRagasEmbeddings
Embedding model for generation process.
knowledge_graph : KnowledgeGraph, default empty
The knowledge graph to use for the generation process.
"""

llm: BaseRagasLLM
embedding_model: BaseRagasEmbeddings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks good but just ones point here. The TestsetGenerator object does not need an embedding model since it does not use it why synthesizing the testset (embeddings are only need for transforms).

I love make the llm and embedding model and forcing the user there - will help solve a lot of the issue

Thanks a lot for making this contribution 🙂 ❤️, let me know what you think about the point I made

Copy link
Contributor Author

@hunter-walden2113 hunter-walden2113 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the quick review. The reason I added embedding_model as an attribute is because in line 61, I made it so you can pass your embedding model to the from_langchain method which returns an instance of TestsetGenerator, containing both the LLM and embedding model (line 68). It allows the TestsetGenerator object to have an embedding model attributed to it, so if the user does not pass an embedding model to the generate_with_langchain_docs for the transforms_embedding parameter, it can use the embedding model that is attributed to the class, and the method will not throw an error (see line 108).

I do understand your point though. And we can remove that if you want. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is okay for now - merging this in 🙂

knowledge_graph: KnowledgeGraph = field(default_factory=KnowledgeGraph)

@classmethod
def from_langchain(
cls,
llm: LangchainLLM,
embedding_model: LangchainEmbeddings,
knowledge_graph: t.Optional[KnowledgeGraph] = None,
) -> TestsetGenerator:
"""
Creates a `TestsetGenerator` from a Langchain LLMs.
"""
knowledge_graph = knowledge_graph or KnowledgeGraph()
return cls(LangchainLLMWrapper(llm), knowledge_graph)
return cls(
LangchainLLMWrapper(llm),
LangchainEmbeddingsWrapper(embedding_model),
knowledge_graph
)

def generate_with_langchain_docs(
self,
Expand All @@ -77,19 +87,26 @@ def generate_with_langchain_docs(
"""
Generates an evaluation dataset based on given scenarios and parameters.
"""
if transforms is None:
# use default transforms
if transforms_llm is None:
transforms_llm = self.llm
logger.info("Using TestGenerator.llm for transforms")
if transforms_embedding_model is None:
raise ValueError(
"embedding_model must be provided for default_transforms. Alternatively you can provide your own transforms through the `transforms` parameter."

# force the user to provide an llm and embedding client to prevent use of default LLMs
if not self.llm and not transforms_llm:
raise ValueError(
'''An llm client was not provided.
Provide an LLM on TestsetGenerator instantiation or as an argument for transforms_llm parameter.
Alternatively you can provide your own transforms through the `transforms` parameter.'''
)
if not self.embedding_model and not transforms_embedding_model:
raise ValueError(
'''An embedding client was not provided.
Provide an embedding model on TestsetGenerator instantiation or as an argument for transforms_llm parameter.
Alternatively you can provide your own transforms through the `transforms` parameter.'''
)

if not transforms:
transforms = default_transforms(
llm=transforms_llm or self.llm,
embedding_model=transforms_embedding_model,
)
llm=transforms_llm or self.llm,
embedding_model=transforms_embedding_model or self.embedding_model
)

# convert the documents to Ragas nodes
nodes = []
Expand Down
Loading