While the released chatbot subsets contain 10 and 100 queries for reproducibility, our curated datasets include xxx candidate queries for LM-Market and CA-Product..
To verify robustness, we scaled the CA dataset to [TODO: 500 and 1,000] queries and observed consistent trends in all key metrics [TODO: as shown in the Table below], confirming that the benchmark conclusions hold across larger scales.
exp configuration
- dataset:
- CA-Product: only modify the last step and explore the best number of queries
- LM-Sys-Market: leverage LLM to select the top-1000 queries that suitable for ad insertion.
- base LLM: doubao; judge LLM: 4.1-mini; embedding model: text-embedding-small