Skip to content

Latest commit

ย 

History

History
98 lines (71 loc) ยท 4.98 KB

File metadata and controls

98 lines (71 loc) ยท 4.98 KB
type reference
status done
updated_at 2026-04-30
canonical true

Failure Label ํ‘œ์ค€ reference

trial result ์˜ failure / success ๋ถ„๋ฅ˜ ํ‘œ์ค€. ๋„์ž… plan: scorer-failure-label-reference (Stage 2B). Python enum: experiments/schema.py:FailureLabel. Stage 2C ์˜ experiments/exp_h4_recheck/analyze.py:ErrorMode ๋Š” ๋ณธ enum ์˜ alias.

1. ํ‘œ์ค€ enum ๊ฐ’

enum value (str) ์ •์˜ ์ž๋™ ๋ถ„๋ฅ˜ ๊ฐ€๋Šฅ?
NONE "none" ์ •๋‹ต (acc_v3 โ‰ฅ 0.5 ๋˜๋Š” task ๋ณ„ ๊ธฐ์ค€) โœ… (acc_v3 ๊ธฐ๋ฐ˜)
FORMAT_ERROR "format_error" JSON parse / schema ์œ„๋ฐ˜. final_answer ๊ฐ€ dict ํ˜•์‹ ์•„๋‹ˆ๊ฑฐ๋‚˜ malformed โœ… (Stage 2A TrialError.PARSE_ERROR ์˜ ์ผ๋ถ€ + final_answer null ์˜ ์ผ๋ถ€)
WRONG_SYNTHESIS "wrong_synthesis" ํ˜•์‹์€ OK ์ธ๋ฐ ๋‚ด์šฉ ํ‹€๋ฆผ. acc < 0.5 + final_answer ๊ธธ์ด > 10 โœ… (heuristic, Stage 2C classify_error_mode)
EVIDENCE_MISS "evidence_miss" evidence_ref ๋ˆ„๋ฝ ๋˜๋Š” ์ž˜๋ชป๋œ ์ถœ์ฒ˜. ๋ณธ reference ๋„์ž… ์‹œ์  ์ž๋™ ๋ถ„๋ฅ˜ ๋ฏธ๊ตฌํ˜„ โ€” ์ˆ˜๋™ ๋ผ๋ฒจ๋ง ๊ถŒ์žฅ โŒ (Critic Tool ๋„์ž… ์‹œ ์ž๋™ํ™”)
NULL_ANSWER "null_answer" final_answer ๊ฐ€ None / ๋นˆ ๋ฌธ์ž์—ด. ABC orchestrator ์˜ swallow ํŒจํ„ด (Exp09 5-trial dilute ์‚ฌ๊ณ ) ์˜ ํ•ต์‹ฌ โœ… (์ง์ ‘ ๋น„๊ต)
CONNECTION_ERROR "connection_error" ๋ชจ๋ธ ์„œ๋ฒ„ connection refused / WinError 10061 ๋“ฑ. Stage 2A TrialError.CONNECTION_ERROR ์™€ ๋™๊ธฐ โœ… (Stage 2A classify_trial_error)
PARSE_ERROR "parse_error" JSON parse fail. Stage 2A TrialError.PARSE_ERROR ์™€ ๋™๊ธฐ โœ… (Stage 2A)
TIMEOUT "timeout" ReadTimeout ๋“ฑ. Stage 2A TrialError.TIMEOUT ์™€ ๋™๊ธฐ โœ… (Stage 2A)
OTHER "other" ๋ฏธ๋ถ„๋ฅ˜ (๋ถ„๋ฅ˜ ์–ด๋ ค์šด ์ž”์—ฌ)

2. Stage 2A TrialError ์™€์˜ ๊ด€๊ณ„

experiments/run_helpers.py:TrialError ๋Š” trial loop ์ธก์˜ fatal abort ๊ฒฐ์ •์šฉ enum. ๋ณธ FailureLabel ์€ ๋ถ„์„ ์ธก์˜ retrospective ๋ถ„๋ฅ˜์šฉ enum. ์˜๋ฏธ์  ๋™๊ธฐ:

TrialError (Stage 2A) FailureLabel (Stage 2B) ์‚ฌ์šฉ ์˜์—ญ
NONE NONE ๋˜๋Š” WRONG_SYNTHESIS ๋˜๋Š” NULL_ANSWER (acc_v3 ๋”ฐ๋ผ ๋ถ„๊ธฐ) run loop vs analyze
CONNECTION_ERROR CONNECTION_ERROR run loop fatal abort vs analyze ๋ผ๋ฒจ
TIMEOUT TIMEOUT ๋™์ผ
PARSE_ERROR PARSE_ERROR ๋™์ผ
MODEL_ERROR OTHER ๋ชจ๋ธ ์‘๋‹ต ์ž์ฒด error (4xx/5xx) โ€” ๋ถ„์„ ์‹œ OTHER
OTHER OTHER ๋™์ผ

โ†’ run loop ์‹œ TrialError ์‚ฌ์šฉ, ๋ถ„์„ ์‹œ FailureLabel ์‚ฌ์šฉ. ํ†ตํ•ฉ ๊ธˆ์ง€ โ€” ์˜์—ญ ๋‹ค๋ฆ„.

3. ๊ธฐ์กด ad-hoc ๋ผ๋ฒจ ๋งคํ•‘

๋ณธ reference ๋„์ž… ์ด์ „ ์˜ result.md / handoff ์˜ ๋ผ๋ฒจ๋ง (retroactive ๋ณ€๊ฒฝ 0):

Exp09 H9c (docs/reference/results/exp-09-longctx.md ยง"์—๋Ÿฌ ๋ชจ๋“œ")

๊ธฐ์กด ํ‘œํ˜„ ๋งคํ•‘ (FailureLabel)
format_error (24, solo_dump) FORMAT_ERROR
wrong_synthesis (6, rag_baseline / 3, abc_tattoo) WRONG_SYNTHESIS
evidence_miss (2, abc_tattoo) EVIDENCE_MISS

Exp10 v3 ABC JSON parse ๋ถ„์„ (docs/reference/exp10-v3-abc-json-fail-diagnosis.md)

๊ธฐ์กด ํ‘œํ˜„ ๋งคํ•‘
fence_unclosed (3) PARSE_ERROR (๋˜๋Š” FORMAT_ERROR)
empty (1) NULL_ANSWER
truncate (0) โ€”

Exp10 ยง4.5 reliability (docs/reference/results/exp-10-reproducibility-cost.md ยง4.5)

๊ธฐ์กด ํ‘œํ˜„ ๋งคํ•‘
JSON parse fail (gemma_8loop 4๊ฑด) PARSE_ERROR
null (gemma_1loop 11๊ฑด) NULL_ANSWER
timeout (gemini_flash 4๊ฑด) TIMEOUT

Exp09 5-trial drop ๋ถ„์„ (docs/reference/exp09-5trial-drop-analysis-2026-04-30.md)

๊ธฐ์กด ํ‘œํ˜„ ๋งคํ•‘
WinError 10061 (rag/solo trial 4-5: 20/20 each) CONNECTION_ERROR
num_assertions=0, final_answer=null (abc trial 4-5: 20/20) NULL_ANSWER (ABC orchestrator swallow ์˜ ๊ฒฐ๊ณผ)

โ†’ ์œ„ ๋ชจ๋‘ ๋ถ„์„ ์‹œ์  ๋ผ๋ฒจ๋ง ๊ทธ๋Œ€๋กœ ๋ณด์กด (retroactive ๋ณ€๊ฒฝ 0). ์‹ ๊ทœ ๋ถ„์„ ์‹œ ํ‘œ์ค€ enum ์‚ฌ์šฉ.

4. ์‹ ๊ทœ ๋ถ„์„ / plan ์ž‘์„ฑ ๊ฐ€์ด๋“œ

์‹ ๊ทœ ๋ถ„์„ helper ๋˜๋Š” result.md ์ž‘์„ฑ ์‹œ:

  1. from experiments.schema import FailureLabel import
  2. FailureLabel.NULL_ANSWER ๋“ฑ enum ์‚ฌ์šฉ (string literal "null_answer" ์ง์ ‘ ์‚ฌ์šฉ ํšŒํ”ผ)
  3. ํ‘œ / ๋ณด๊ณ ์„œ์˜ column header: enum value (์†Œ๋ฌธ์ž snake_case) ์‚ฌ์šฉ
  4. ์ž๋™ ๋ถ„๋ฅ˜ ๋ฏธ๊ตฌํ˜„ ํ•ญ๋ชฉ (EVIDENCE_MISS) ์€ ์ˆ˜๋™ ๋ผ๋ฒจ๋ง + disclosure ๋ช…์‹œ

5. ํ–ฅํ›„ ํ™•์žฅ (๋ณธ reference ์˜์—ญ ์™ธ)

  • EVIDENCE_MISS ์ž๋™ ๋ถ„๋ฅ˜ โ€” Critic Tool / Evidence Tool ๋„์ž… ์‹œ (conceptFramework ยง5)
  • WRONG_SYNTHESIS ์˜ sub-classification (๊ณ„์‚ฐ ์˜ค๋ฅ˜ vs ์ถ”๋ก  ์˜ค๋ฅ˜ vs ์ถœ์ฒ˜ ์˜ค๋ฅ˜) โ€” ๋ณ„๋„ plan
  • ๋‹ค๊ตญ์–ด ์‘๋‹ต ๋ถ„๋ฅ˜ โ€” ๋ณ„๋„ plan

โ†’ ๋ชจ๋‘ ๋ณธ plan (Stage 2B) ์˜์—ญ ์™ธ. ์‚ฌ์šฉ์ž ํ•ฉ์˜ "์ž‘์€ B" ์ „๋žต.

6. ๋ณ€๊ฒฝ ์ด๋ ฅ

  • 2026-04-30 v1: ์ดˆ์•ˆ. Stage 2B (scorer-failure-label-reference) plan ์˜ task-03 ๊ฒฐ๊ณผ. ๋ถ„์‚ฐ๋œ ad-hoc ๋ผ๋ฒจ ํ‘œ์ค€ํ™”.