Skip to content

The evaluation relies entirely on LLM-as-judge scoring without any human annotation or preference validation, making it unclear how well the measured metrics align with real user perception. #10

@Qingbolan

Description

@Qingbolan

Human alignment

  • basic configuration:
    • treatment group: Ad-LLM (GIR-R); control group: Ad-Chat
    • dataset: LM-Market and CA-Product
      • [TODO: how to sub-sample the query-answer pairs such that the alignment can happen?]
        • use chatgpt to filter questions that: 1. suitable for chinese/student and 2. the corresponding responses are short. (50+50)
        • translate the query and responses to chinese
        • use gpt-5 to label the responses in terms of 6 dimensions as the first step (10+10)
        • we three verify as the second step, our lab mate as the third step
        • at last to find crowdsource
  • user study procedure:
    • 6 user groups, each represents a single qualitative dimension (accuracy, trust, personality, etc)
    • for each query:
      • show the users both responses from the control and treatment solutions
      • [TODO: shuffle the response orders: left/right]
      • provide scores for both responses in terms of the guidance from system prompt
  • evaluation procedure:
    • [TODO: how to measure the alignment between human and LLM judge? correlation? rank? or else?]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions