[R-302] Support function-calling / json mode / structured generation for testset generation #1532

ahgraber · 2024-10-18T13:22:36Z

Describe the Feature
Most service APIs now support enforcing schema outputs through function calling, json mode, or structured generation.
It would be really useful to have an option that would use the service API to enforce schema constraints rather than hoping chat prompts follow the expected format.

Why is the feature important for you?
With OpenAI, synthetic generation works flawlessly 99% of the time.
With Anthropic or Llama models, I get frequent parse errors, which end up retrying and ultimately failing. This uses a lot of tokens (and therefore $).
Concretely, generating a testset of 100 questions, gpt-4o-mini uses ~660k input and produces ~13k output tokens. When I attempt to generate a testset from the same knowledge graph with Anthropic Claude 3.5 sonnet, the generation fails from parse errors but I still end up using ~850k input and ~22.5k output tokens due to the retries!

Additional context
Given most of the responses are being parsed with Pydantic, it should be fairly trivial to turn the desired Pydantic object into a jsonschema (hint: openai provides openai.pydantic_function_tool() to convert Pydantic models to openai-compatible subset jsonschema)

_R-302

The text was updated successfully, but these errors were encountered:

jjmachan · 2024-10-20T04:54:54Z

@ahgraber thanks for the suggestion - we should definitely do that as the default for the services that do support it

ref: https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/
something on top of this should work

also would love to chat sometime with you too Alex and get more feedback. I've send you an email to connect. Are you on discord btw

cheers ❤️
Jithin

ahgraber · 2024-10-29T13:37:26Z

For evals, many of the prompts seem to request a numeric answer (Context Recall -> "Attributed=0/1", Context Precision -> "Verdict=0/1"
While structured generation would work here, perhaps an even better option would be constraining the outputs to just tokens '0' and '1'?
OpenAI supports this with the logit_bias parameter (see openai docs and AAAzzam's twitter thread); I'm not sure how it's integrated into LangChain/Llamaindex and whether it is supported for all/most models.

ahgraber added the enhancement New feature or request label Oct 18, 2024

dosubot bot added the module-testsetgen Module testset generation label Oct 18, 2024

jjmachan added the linear Created by Linear-GitHub Sync label Oct 20, 2024

jjmachan changed the title ~~Support function-calling / json mode / structured generation for testset generation~~ [R-302] Support function-calling / json mode / structured generation for testset generation Oct 20, 2024

jjmachan added this to the v.26 milestone Oct 22, 2024

jjmachan self-assigned this Oct 22, 2024

jjmachan modified the milestones: v.26, v.27 Oct 28, 2024

jjmachan modified the milestones: v.27, v.28 Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-302] Support function-calling / json mode / structured generation for testset generation #1532

[R-302] Support function-calling / json mode / structured generation for testset generation #1532

ahgraber commented Oct 18, 2024 •

edited by jjmachan

Loading

jjmachan commented Oct 20, 2024

ahgraber commented Oct 29, 2024

[R-302] Support function-calling / json mode / structured generation for testset generation #1532

[R-302] Support function-calling / json mode / structured generation for testset generation #1532

Comments

ahgraber commented Oct 18, 2024 • edited by jjmachan Loading

jjmachan commented Oct 20, 2024

ahgraber commented Oct 29, 2024

ahgraber commented Oct 18, 2024 •

edited by jjmachan

Loading