The chatbot datasets contain only 10 and 100 queries, respectively, which may limit statistical robustness and generalizability to diverse domains.

While the released chatbot subsets contain 10 and 100 queries for reproducibility, our **curated datasets include xxx candidate queries** for LM-Market and CA-Product..

To verify robustness, we **scaled the CA dataset to [TODO: 500 and 1,000] queries** and observed **consistent trends in all key metrics [TODO: as shown in the Table below]**, confirming that the benchmark conclusions hold across larger scales.

exp configuration

- dataset:
    - CA-Product: **only modify the last step and explore the best number of queries**
    - LM-Sys-Market: leverage LLM to select the top-1000 queries that suitable for ad insertion.
- base LLM: doubao; judge LLM: 4.1-mini; embedding model: text-embedding-small

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The chatbot datasets contain only 10 and 100 queries, respectively, which may limit statistical robustness and generalizability to diverse domains. #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The chatbot datasets contain only 10 and 100 queries, respectively, which may limit statistical robustness and generalizability to diverse domains. #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions