[RFC] Testset Generation: making it faster and easy to use #380

jjmachan · 2023-12-14T05:00:14Z

What is this about?

We have had Synthetic Test Data generation in beta for a while and many of you have given us valuable feedback on the same. Now we are reworking it to be faster and extensible for a wider use.

Ragas takes a novel approach to evaluation data generation. An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents. This approach ensures comprehensive coverage of the performance of various components within your pipeline, resulting in a more robust evaluation process.

Core Components

Evolutions - this is the core and defines how to evolve the given (context, question) pair into more complex questions - adding more context if needed.
TestsetGenerator - this takes the LLM, evolutions, Documents and other configurations and returns the generated testset. This class is also responsible for scheduling the different runs in parallel for max throughput.
DocumentStore and Document - Document is a extension of langchain_core's Document abstraction. DocumentStore is responsible for connecting with the available documents and giving Evolutions and interface to fetch documents (adjacent and similar) as needed.
Filter - filters critique the output from the evolutions and decides if it should be accepted or not. The Evolution decides how to evolve the (context, question) and Filter checks if it is acceptable or not.

Usage

High Level

User can use it by importing the evolutions, defining the distribution of the evolutions in the final testset and configuring TestsetGenerator.

from ragas.testset.evolutions import simple, reasoning, multihop, BaseEvolution
from ragas import TestsetGenerator

# define evolutions you will need
evolutions = {
  simple: 0.4,
  reasoning: 0.4,
  multihop: 0.2
}

generator = TestsetGenerator(
    generator_llm: RagasLLM,
    critic_llm: RagasLLM,
    embeddings_model: Embeddings,	
) -> TestsetGenerator

testset = generator.generate(
		documents: Documents = docs,
		doc_store: t.Optional[DocumentStore] # not going to do now
		evolutions: dict[Evolution, float],
		test_size: int,
)
testset_df = testset.to_pandas()

# with openai
generator = TestsetGenerator.with_openai(		
    generator_llm: str = "gpt3.5",
    critic_llm: str = "gpt4",
    embeddings_model: Embeddings,	
)
testset = generator.generate(
		documents: Documents = docs,
		doc_store: t.Optional[DocumentStore] # not going to do now
		evolutions: dict[Evolution, float],
		test_size: int,
)
testset_df = testset.to_pandas()

Your own `Evolution`s and `Filter`s

If you want to create a new Evolutions, you will have to sub-class the BaseEvolution and create subclass of BaseFilter.

# base filter
@dataclass
class BaseFilter
    llm: RagasLLM
    filter_prompt; 
    def __call__() -> Bool:

@dataclass
class BaseEvolution(ABC):
    llm: RagasLLM
    filters: list[Callable]
    docstore: DocumentStore
    @absctract_method
    def evolve():
    async def aevolve():
	    raise NotImplemented

Document Storage

By default there will be an InMemoryDocStore but you can also connect it with other databases by extending the BaseDocumentStore class

class Document(LCDocument):
    doc_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    filename: t.Optional[str] = None
    embedding: t.Optional[t.List[float]] = Field(default=None, repr=False)


class DocumentStore(ABC):
    def __init__(self):
        self.documents = {}

    @abstractmethod
    def add(self, docs: t.List[Document], show_progress: bool = False):
        ...

    @abstractmethod
    def get(self, doc_id: int) -> Document:
        ...

    @abstractmethod
    def get_similar(
        self, doc: Document, threshold: float = 0.7, top_k: int = 3
    ) -> t.List[Document]:
        ...

    @abstractmethod
    def get_adjascent(self, doc: Document, direction: str = "next") -> t.List[Document]:
        ...

Issues this will fix

The text was updated successfully, but these errors were encountered:

babysor · 2023-12-18T06:19:24Z

Could you also allow it to process parallelly?

jjmachan · 2023-12-18T10:32:48Z

yes @babysor that would be there. The ideas is that if say that you need 100 dataset examples each of those 100 items will be created in parallel - either with async or in threads

implements #380 --------- Co-authored-by: Shahules786 <[email protected]>

jjmachan · 2024-01-09T03:22:59Z

related issues to solve

perform an evolution ```py from ragas.testset.evolutions import SimpleEvolution, NodeFilter, QuestionFilter, logger node_filter = NodeFilter(gpt4) ques_filter = QuestionFilter(gpt4) se = SimpleEvolution(node_filter, ques_filter) await se.aevolve(llm, docstore) ``` run evolutions with executor ```py from ragas.executor import Executor exec = Executor(is_async=False) qs = [] for i in range(10): se = SimpleEvolution(node_filter, ques_filter) exec.submit(se.evolve, llm, docstore, name=f"SimpleEvolution-{i}") try: qs = exec.results() except ValueError: se = SimpleEvolution(node_filter, ques ``` generates 300 samples in <6min, should be scalable enough related to #380

…496) usage: ```py from ragas.testset.generator import TestsetGenerator # generator with openai models generator = TestsetGenerator.with_openai() generator # specify distributions from ragas.testset.evolutions import simple, reasoning, multi_context distributions = { simple: 0.5, multi_context: 0.4, reasoning: 0.1 } distributions # generate testset testset = generator.generate_with_llamaindex_docs(documents, 100, distributions) testset.to_pandas() ``` 100 rows in <4mins part of #380

jjmachan · 2024-02-07T18:25:29Z

finished with the release of v0.1 :)

hodgesz · 2024-02-07T18:36:42Z

Awesome, thanks so much!

…

On Wed, Feb 7, 2024 at 11:25 AM Jithin James ***@***.***> wrote: finished with the release of v0.1 :) — Reply to this email directly, view it on GitHub <#380 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANO4ZMVF2LVKBJZXIQHVA3YSPBKNAVCNFSM6AAAAABAUHQQT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGYZDSOBTG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jjmachan · 2024-02-07T18:40:38Z

will close the rest of the related issues too - most have been fixed the new version

jjmachan added enhancement New feature or request RFC labels Dec 14, 2023

jjmachan added this to the v0.1.0 milestone Dec 14, 2023

jjmachan self-assigned this Dec 14, 2023

This was referenced Dec 14, 2023

feat: improving testset generation #381

Merged

feat(testset_generator): support error catching in generation process #368

Merged

feat: add custom prompt feature to testset generator #278

Closed

jjmachan mentioned this issue Dec 19, 2023

[RFC] Executor: making Ragas faster and more reliable #394

Closed

shahules786 mentioned this issue Dec 21, 2023

The FILTER_QUESTION prompt used for testset_generator.py need to be corrected / improved #400

Closed

shahules786 mentioned this issue Dec 29, 2023

harmonized and removed typos in prompts in testset generation #397

Closed

jjmachan mentioned this issue Jan 8, 2024

TestsetGenerator issue with small Q&A FAQ dataset - multiple 'no neighbors exists' errors and limited generation output #423

Closed

jjmachan added a commit that referenced this issue Jan 8, 2024

feat: improving testset generation (#381)

041b20c

implements #380 --------- Co-authored-by: Shahules786 <[email protected]>

jjmachan mentioned this issue Jan 14, 2024

feat: MVP for the new TestsetGenerator - SimpleEvolution #464

Merged

jjmachan mentioned this issue Jan 23, 2024

feat(testset): ported simple, multi_context and reasoning evolutions #496

Merged

jjmachan closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Testset Generation: making it faster and easy to use #380

[RFC] Testset Generation: making it faster and easy to use #380

jjmachan commented Dec 14, 2023

babysor commented Dec 18, 2023

jjmachan commented Dec 18, 2023

jjmachan commented Jan 9, 2024

jjmachan commented Feb 7, 2024

hodgesz commented Feb 7, 2024 via email

jjmachan commented Feb 7, 2024

[RFC] Testset Generation: making it faster and easy to use #380

[RFC] Testset Generation: making it faster and easy to use #380

Comments

jjmachan commented Dec 14, 2023

What is this about?

Core Components

Usage

High Level

Your own Evolutions and Filters

Document Storage

Issues this will fix

babysor commented Dec 18, 2023

jjmachan commented Dec 18, 2023

jjmachan commented Jan 9, 2024

jjmachan commented Feb 7, 2024

hodgesz commented Feb 7, 2024 via email

jjmachan commented Feb 7, 2024

Your own `Evolution`s and `Filter`s