-
Notifications
You must be signed in to change notification settings - Fork 896
[RFC] Testset Generation: making it faster and easy to use #380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
This was referenced Dec 14, 2023
Could you also allow it to process parallelly? |
yes @babysor that would be there. The ideas is that if say that you need 100 dataset examples each of those 100 items will be created in parallel - either with |
jjmachan
added a commit
that referenced
this issue
Jan 8, 2024
implements #380 --------- Co-authored-by: Shahules786 <[email protected]>
jjmachan
added a commit
that referenced
this issue
Jan 18, 2024
perform an evolution ```py from ragas.testset.evolutions import SimpleEvolution, NodeFilter, QuestionFilter, logger node_filter = NodeFilter(gpt4) ques_filter = QuestionFilter(gpt4) se = SimpleEvolution(node_filter, ques_filter) await se.aevolve(llm, docstore) ``` run evolutions with executor ```py from ragas.executor import Executor exec = Executor(is_async=False) qs = [] for i in range(10): se = SimpleEvolution(node_filter, ques_filter) exec.submit(se.evolve, llm, docstore, name=f"SimpleEvolution-{i}") try: qs = exec.results() except ValueError: se = SimpleEvolution(node_filter, ques ``` generates 300 samples in <6min, should be scalable enough related to #380
shahules786
pushed a commit
that referenced
this issue
Jan 23, 2024
…496) usage: ```py from ragas.testset.generator import TestsetGenerator # generator with openai models generator = TestsetGenerator.with_openai() generator # specify distributions from ragas.testset.evolutions import simple, reasoning, multi_context distributions = { simple: 0.5, multi_context: 0.4, reasoning: 0.1 } distributions # generate testset testset = generator.generate_with_llamaindex_docs(documents, 100, distributions) testset.to_pandas() ``` 100 rows in <4mins part of #380
finished with the release of v0.1 :) |
Awesome, thanks so much!
…On Wed, Feb 7, 2024 at 11:25 AM Jithin James ***@***.***> wrote:
finished with the release of v0.1 :)
—
Reply to this email directly, view it on GitHub
<#380 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANO4ZMVF2LVKBJZXIQHVA3YSPBKNAVCNFSM6AAAAABAUHQQT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGYZDSOBTG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
will close the rest of the related issues too - most have been fixed the new version |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What is this about?
We have had Synthetic Test Data generation in beta for a while and many of you have given us valuable feedback on the same. Now we are reworking it to be faster and extensible for a wider use.
Ragas takes a novel approach to evaluation data generation. An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents. This approach ensures comprehensive coverage of the performance of various components within your pipeline, resulting in a more robust evaluation process.

Core Components
Evolutions
- this is the core and defines how to evolve the given (context, question) pair into more complex questions - adding more context if needed.TestsetGenerator
- this takes the LLM, evolutions, Documents and other configurations and returns the generated testset. This class is also responsible for scheduling the different runs in parallel for max throughput.DocumentStore
andDocument
-Document
is a extension of langchain_core's Document abstraction.DocumentStore
is responsible for connecting with the available documents and givingEvolutions
and interface to fetch documents (adjacent and similar) as needed.Filter
- filters critique the output from the evolutions and decides if it should be accepted or not. TheEvolution
decides how to evolve the (context, question) andFilter
checks if it is acceptable or not.Usage
High Level
User can use it by importing the evolutions, defining the distribution of the evolutions in the final testset and configuring
TestsetGenerator
.Your own
Evolution
s andFilter
sIf you want to create a new Evolutions, you will have to sub-class the
BaseEvolution
and create subclass ofBaseFilter
.Document Storage
By default there will be an
InMemoryDocStore
but you can also connect it with other databases by extending theBaseDocumentStore
classIssues this will fix
The text was updated successfully, but these errors were encountered: