The evaluation relies entirely on LLM-as-judge scoring without any human annotation or preference validation, making it unclear how well the measured metrics align with real user perception.

Human alignment

- basic configuration:
    - treatment group: Ad-LLM (GIR-R); control group: Ad-Chat
    - dataset: LM-Market and CA-Product
        - **[TODO: how to sub-sample the query-answer pairs such that the alignment can happen?]**
            - use chatgpt to filter questions that: 1. suitable for chinese/student and 2. the corresponding responses are short. (50+50)
            - translate the query and responses to chinese
            - use gpt-5 to label the responses in terms of 6 dimensions as the first step (10+10)
            - we three verify as the second step, our lab mate as the third step
            - at last to find crowdsource
- user study procedure:
    - 6 user groups, each represents a single qualitative dimension (accuracy, trust, personality, etc)
    - for each query:
        - show the users both responses from the control and treatment solutions
        - [TODO: shuffle the response orders: left/right]
        - provide scores for both responses in terms of the guidance from system prompt
- evaluation procedure:
    - **[TODO: how to measure the alignment between human and LLM judge? correlation? rank? or else?]**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The evaluation relies entirely on LLM-as-judge scoring without any human annotation or preference validation, making it unclear how well the measured metrics align with real user perception. #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The evaluation relies entirely on LLM-as-judge scoring without any human annotation or preference validation, making it unclear how well the measured metrics align with real user perception. #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions