[AGENT] Add reusable hallucination audit learnings#117
Conversation
Greptile OverviewGreptile SummaryExtracts reusable hallucination audit tooling and documentation from audit work, including 6 Python analysis scripts, findings documentation, and a concrete mitigation plan. Scripts implement transcript classification, duplicate detection, audio volume analysis, full audio transcription, and Langfuse dataset sampling. Documentation captures key findings (379 hallucinated, 495 legit, 1178 unknown from 2052 traces) and recommends keeping current prompt-echo guards while adding tunable config keys and a vote-transcription path for suspicious snippets. All scripts follow clean architecture with proper error handling, retry logic for rate limits, and incremental output saves.
Confidence Score: 5/5
Important Files Changed
Last reviewed commit: bb013a2 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb013a236a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| chosen_window = max(window_results.keys(), key=lambda k: len(window_results[k])) | ||
| traces = window_results[chosen_window] |
There was a problem hiding this comment.
Honor --meeting-id across all fetched windows
When --date is omitted, the script fetches both today and yesterday but then picks only the window with the most traces before it applies --meeting-id. If the requested meeting exists in the other window, filtered becomes empty and the audit proceeds with zero records, producing a misleading summary and incorrect downstream artifacts. Resolve the meeting ID against all fetched windows (or choose the matching window first) before narrowing the trace set.
Useful? React with 👍 / 👎.
| if list(segment_dir.glob("segment_*.mp3")): | ||
| return sorted(segment_dir.glob("segment_*.mp3")) |
There was a problem hiding this comment.
Regenerate segments when audio/chunk inputs change
The segment cache is reused whenever any segment_*.mp3 exists, without verifying that those files were produced from the current audio input or current --chunk-seconds value. Re-running after replacing audio_combined.mp3 or changing chunk size will silently transcribe stale segments, so full_transcript.txt can be out of sync with the intended meeting audio and corrupt alignment results.
Useful? React with 👍 / 👎.
| if response.status_code in (400, 409): | ||
| return |
There was a problem hiding this comment.
Raise on dataset-item 400 responses
Treating HTTP 400 as success causes malformed dataset-item requests to be silently dropped while the script continues as if sampling succeeded. If Langfuse rejects an item payload (schema change, invalid field, etc.), the resulting dataset can be smaller or imbalanced without any failure signal, which undermines audit/eval accuracy.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR extracts reusable learnings and tooling from a hallucination audit (meeting ID 3837e4e0-64e9-44ba-b5de-c3a6849832d6, conducted on 2026-02-10) into the mainline repository without including large raw artifacts. The audit analyzed 2052 traces, finding 379 hallucinated outputs (primarily prompt echo), 495 legitimate outputs, and 1178 unknown. The PR provides documentation, a mitigation plan, and a set of Python scripts for future audit work.
Changes:
- Adds documentation of audit findings and mitigation recommendations
- Adds six reusable Python scripts for conducting hallucination audits
- Updates
.gitignorefor Python cache files and husky internals
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
docs/hallucination-audit-20260210.md |
Documents the 2026-02-10 hallucination audit findings including classification counts, SPS analysis, and audio volume metrics |
docs/hallucination-mitigation-plan-20260213.md |
Provides concrete mitigation recommendations including config changes, system improvements, and rollout strategy |
analysis/hallucination-audit/README.md |
Overview of the audit workspace and available scripts |
analysis/hallucination-audit/run_audit.py |
Main audit orchestration script that fetches Langfuse traces, classifies them, and builds duplicate groups |
analysis/hallucination-audit/compute_audio_volume.py |
Computes mean and max audio volume (dB) for Langfuse media using ffmpeg |
analysis/hallucination-audit/download_full_audio.py |
Downloads full meeting audio from S3 for analysis |
analysis/hallucination-audit/transcribe_full_audio.py |
Transcribes full meeting audio using OpenAI in segments |
analysis/hallucination-audit/align_with_full_transcript.py |
Aligns snippet transcripts with full transcript using fuzzy matching; uses csv.DictWriter for robust CSV output |
analysis/hallucination-audit/create_langfuse_dataset_sample.py |
Creates balanced Langfuse dataset samples for labeling with validation of class keys |
.gitignore |
Adds Python cache patterns and husky internal files |
| def build_index(words: List[str]) -> Dict[str, List[int]]: | ||
| index: Dict[str, List[int]] = {} | ||
| for pos, word in enumerate(words): | ||
| if len(word) < 4: | ||
| continue | ||
| index.setdefault(word, []).append(pos) | ||
| return index | ||
|
|
||
|
|
||
| def find_subsequence_window( | ||
| snippet_words: List[str], | ||
| full_words: List[str], | ||
| ) -> Optional[Tuple[int, int]]: | ||
| if not snippet_words or not full_words or len(snippet_words) > len(full_words): | ||
| return None | ||
| snippet_length = len(snippet_words) | ||
| last_start = len(full_words) - snippet_length | ||
| for start in range(last_start + 1): | ||
| if full_words[start : start + snippet_length] == snippet_words: | ||
| return (start, start + snippet_length) | ||
| return None | ||
|
|
||
|
|
||
| def best_match( | ||
| snippet_text: str, | ||
| snippet_words: List[str], | ||
| full_text: str, | ||
| full_words: List[str], | ||
| index: Dict[str, List[int]], | ||
| ) -> Tuple[Optional[float], str, Optional[Tuple[int, int]]]: | ||
| if not snippet_text: | ||
| return None, "empty", None | ||
| if snippet_text in full_text: | ||
| return 1.0, "substring", find_subsequence_window(snippet_words, full_words) | ||
|
|
||
| unique_words = sorted(set(snippet_words), key=len, reverse=True) | ||
| candidates = [word for word in unique_words if len(word) >= 4][:3] | ||
| if not candidates: | ||
| return None, "no_candidates", None | ||
|
|
||
| window_size = max(8, min(len(full_words), len(snippet_words) + 6)) | ||
| snippet_word_set = set(snippet_words) | ||
| best_score: Optional[float] = None | ||
| best_window: Optional[Tuple[int, int]] = None | ||
| for word in candidates: | ||
| positions = index.get(word, []) | ||
| if len(positions) > 100: | ||
| positions = positions[:100] | ||
| for pos in positions: | ||
| start = max(0, pos - 3) | ||
| end = min(len(full_words), start + window_size) | ||
| window_text = " ".join(full_words[start:end]) | ||
| if not window_text: | ||
| continue | ||
|
|
||
| if best_score is not None: | ||
| max_possible = 1 - ( | ||
| abs(len(snippet_text) - len(window_text)) | ||
| / max(len(snippet_text), len(window_text)) | ||
| ) | ||
| if max_possible <= best_score: | ||
| continue | ||
|
|
||
| window_word_set = set(full_words[start:end]) | ||
| if snippet_word_set and window_word_set: | ||
| overlap_ratio = len(snippet_word_set & window_word_set) / len( | ||
| snippet_word_set | ||
| ) | ||
| if overlap_ratio < 0.25: | ||
| continue | ||
|
|
||
| dist = levenshtein_distance(snippet_text, window_text) | ||
| ratio = dist / max(len(snippet_text), len(window_text)) | ||
| score = 1 - ratio | ||
| if best_score is None or score > best_score: | ||
| best_score = score | ||
| best_window = (start, end) | ||
| return best_score, "fuzzy", best_window |
There was a problem hiding this comment.
This function contains several magic numbers that should be extracted as named constants for better code clarity:
4(lines 40, 73): minimum word length for indexing and candidate selection3(line 73): maximum number of candidate words8(line 77): minimum window size6(line 77): window size padding100(line 83): maximum positions to check per word3(line 86): position offset for window start0.25(line 105): minimum overlap ratio threshold
These magic numbers represent important thresholds for the fuzzy matching algorithm. Extracting them as named constants would make the algorithm's behavior more transparent and easier to tune.
| and avg_logprob > -1.2 | ||
| and min_logprob > -2.5 | ||
| ) |
There was a problem hiding this comment.
The logprob thresholds -1.2 and -2.5 used to classify records as "legit" are magic numbers that should be extracted as named constants. These thresholds are critical for the classification algorithm and may need to be tuned based on further analysis.
Consider extracting these as module-level constants like CLEAN_AVG_LOGPROB_THRESHOLD = -1.2 and CLEAN_MIN_LOGPROB_THRESHOLD = -2.5 to make them more discoverable and easier to adjust.
| data={ | ||
| "model": model, | ||
| "response_format": "json", | ||
| "temperature": "0", |
There was a problem hiding this comment.
The temperature parameter is passed as a string "0" instead of a numeric value. While the OpenAI API may accept this, it's better practice to pass numeric parameters as their proper type (integer or float) for API calls. This should be "temperature": 0 or "temperature": 0.0.
| "temperature": "0", | |
| "temperature": 0, |
| def load_env(path: Path) -> None: | ||
| if not path.exists(): | ||
| return | ||
| for line in path.read_text(encoding="utf-8").splitlines(): | ||
| stripped = line.strip() | ||
| if not stripped or stripped.startswith("#") or "=" not in stripped: | ||
| continue | ||
| key, value = stripped.split("=", 1) | ||
| key = key.strip() | ||
| value = value.strip().strip('"').strip("'") | ||
| if key and key not in os.environ: | ||
| os.environ[key] = value |
There was a problem hiding this comment.
The load_env function is duplicated across all six Python scripts. This violates the DRY (Don't Repeat Yourself) principle and makes maintenance harder. If a bug is found or improvement is needed in this function, it would need to be fixed in six places.
Consider extracting this function into a shared utility module (for example, analysis/hallucination-audit/utils.py or analysis/hallucination-audit/common.py) that all scripts can import from. This would ensure consistent behavior across all scripts and make future updates easier.
| def build_near_duplicate_groups(records: List[Dict[str, Any]]) -> Dict[str, int]: | ||
| candidates: List[Tuple[str, str]] = [] | ||
| for record in records: | ||
| norm = record.get("normalized_text") | ||
| if not norm or len(norm) < 12: | ||
| continue | ||
| candidates.append((record["trace_id"], norm)) | ||
| if not candidates: | ||
| return {} | ||
|
|
||
| buckets: Dict[Tuple[int, str], List[Tuple[str, str]]] = defaultdict(list) | ||
| for trace_id, norm in candidates: | ||
| length_bucket = len(norm) // 20 | ||
| prefix = norm[:5] | ||
| buckets[(length_bucket, prefix)].append((trace_id, norm)) | ||
|
|
||
| parent: Dict[str, str] = {} | ||
|
|
||
| def find(x: str) -> str: | ||
| root = parent.get(x, x) | ||
| if root != x: | ||
| parent[x] = find(root) | ||
| return parent.get(x, x) | ||
|
|
||
| def union(a: str, b: str) -> None: | ||
| ra = find(a) | ||
| rb = find(b) | ||
| if ra != rb: | ||
| parent[rb] = ra | ||
|
|
||
| for items in buckets.values(): | ||
| if len(items) < 2: | ||
| continue | ||
| if len(items) > 200: | ||
| continue | ||
| for i in range(len(items)): | ||
| trace_a, text_a = items[i] | ||
| for j in range(i + 1, len(items)): | ||
| trace_b, text_b = items[j] | ||
| if abs(len(text_a) - len(text_b)) > 20: | ||
| continue | ||
| dist = levenshtein_distance(text_a, text_b) | ||
| ratio = dist / max(len(text_a), len(text_b)) | ||
| if ratio <= 0.2: | ||
| union(trace_a, trace_b) | ||
|
|
||
| groups: Dict[str, int] = {} | ||
| group_id = 1 | ||
| clusters: Dict[str, List[str]] = defaultdict(list) | ||
| for trace_id, _ in candidates: | ||
| root = find(trace_id) | ||
| clusters[root].append(trace_id) | ||
| for trace_ids in clusters.values(): | ||
| if len(trace_ids) < 2: | ||
| continue | ||
| for trace_id in trace_ids: | ||
| groups[trace_id] = group_id | ||
| group_id += 1 | ||
| return groups |
There was a problem hiding this comment.
This function contains several magic numbers that should be extracted as named constants to improve code clarity:
12(line 126): minimum normalized text length for near-duplicate detection20(line 134): length bucket divisor for grouping similar-length texts5(line 135): prefix length for bucketing200(line 155): maximum bucket size before skipping20(line 161): maximum length difference threshold0.2(line 165): maximum distance ratio for near-duplicates
These magic numbers represent important thresholds for the near-duplicate detection algorithm. Extracting them as named constants would make the code more maintainable and the algorithm's behavior more transparent.
| response = requests.get( | ||
| f"{base_url.rstrip('/')}/api/public/media/{media_id}", | ||
| auth=(public_key, secret_key), | ||
| timeout=60, | ||
| ) | ||
| if response.status_code == 429: | ||
| retry_after = response.headers.get("Retry-After") | ||
| wait_seconds = float(retry_after) if retry_after else delay | ||
| time.sleep(wait_seconds) | ||
| delay = min(delay * 2, 30) | ||
| continue | ||
| response.raise_for_status() | ||
| payload = response.json() | ||
| return payload["url"] |
There was a problem hiding this comment.
The fetch_media_url function can raise RuntimeError after exhausting retries, but it doesn't handle other HTTP errors (non-429 status codes) that might occur during the retry loop. If a 500 error or network error occurs, response.raise_for_status() will raise an exception that terminates the retry loop immediately, even though retrying might succeed.
Consider wrapping response.raise_for_status() in a try-except block that catches requests.HTTPError and requests.RequestException, retrying on transient errors while only raising on final failure.
| response = requests.get( | |
| f"{base_url.rstrip('/')}/api/public/media/{media_id}", | |
| auth=(public_key, secret_key), | |
| timeout=60, | |
| ) | |
| if response.status_code == 429: | |
| retry_after = response.headers.get("Retry-After") | |
| wait_seconds = float(retry_after) if retry_after else delay | |
| time.sleep(wait_seconds) | |
| delay = min(delay * 2, 30) | |
| continue | |
| response.raise_for_status() | |
| payload = response.json() | |
| return payload["url"] | |
| try: | |
| response = requests.get( | |
| f"{base_url.rstrip('/')}/api/public/media/{media_id}", | |
| auth=(public_key, secret_key), | |
| timeout=60, | |
| ) | |
| if response.status_code == 429: | |
| retry_after = response.headers.get("Retry-After") | |
| wait_seconds = float(retry_after) if retry_after else delay | |
| time.sleep(wait_seconds) | |
| delay = min(delay * 2, 30) | |
| continue | |
| response.raise_for_status() | |
| payload = response.json() | |
| return payload["url"] | |
| except (requests.HTTPError, requests.RequestException): | |
| if attempt == retries - 1: | |
| raise | |
| time.sleep(delay) | |
| delay = min(delay * 2, 30) | |
| continue |
| def create_dataset_item( | ||
| base_url: str, | ||
| auth: Tuple[str, str], | ||
| payload: Dict[str, Any], | ||
| retries: int = 5, | ||
| ) -> None: | ||
| delay = 0.5 | ||
| for attempt in range(retries): | ||
| response = requests.post( | ||
| f"{base_url.rstrip('/')}/api/public/dataset-items", | ||
| auth=auth, | ||
| json=payload, | ||
| timeout=30, | ||
| ) | ||
| if response.status_code == 429: | ||
| retry_after = response.headers.get("Retry-After") | ||
| wait_seconds = float(retry_after) if retry_after else delay | ||
| time.sleep(wait_seconds) | ||
| delay = min(delay * 2, 10) | ||
| continue | ||
| if response.status_code in (200, 201): | ||
| return | ||
| if response.status_code in (400, 409): | ||
| return | ||
| response.raise_for_status() | ||
| raise RuntimeError("rate_limited dataset-items") |
There was a problem hiding this comment.
Similar to fetch_media_url in compute_audio_volume.py, the create_dataset_item function can raise exceptions from response.raise_for_status() that terminate the retry loop prematurely. If a transient 5xx error or network error occurs, the function raises immediately instead of retrying.
Consider wrapping response.raise_for_status() in a try-except block to handle transient errors gracefully and only fail after exhausting all retries.
| if len(a) < len(b): | ||
| a, b = b, a | ||
| previous = list(range(len(b) + 1)) | ||
| for i, ca in enumerate(a, start=1): | ||
| current = [i] | ||
| for j, cb in enumerate(b, start=1): | ||
| insert_cost = current[j - 1] + 1 | ||
| delete_cost = previous[j] + 1 | ||
| replace_cost = previous[j - 1] + (0 if ca == cb else 1) | ||
| current.append(min(insert_cost, delete_cost, replace_cost)) | ||
| previous = current | ||
| return previous[-1] |
There was a problem hiding this comment.
The normalize_text and levenshtein_distance functions are duplicated in both run_audit.py and align_with_full_transcript.py. This code duplication violates the DRY principle and creates maintenance burden.
Consider extracting these shared functions into a common utility module that both scripts can import from. This ensures consistent text normalization and distance calculation logic across different parts of the audit tooling.
| if len(a) < len(b): | |
| a, b = b, a | |
| previous = list(range(len(b) + 1)) | |
| for i, ca in enumerate(a, start=1): | |
| current = [i] | |
| for j, cb in enumerate(b, start=1): | |
| insert_cost = current[j - 1] + 1 | |
| delete_cost = previous[j] + 1 | |
| replace_cost = previous[j - 1] + (0 if ca == cb else 1) | |
| current.append(min(insert_cost, delete_cost, replace_cost)) | |
| previous = current | |
| return previous[-1] | |
| len_a = len(a) | |
| len_b = len(b) | |
| dp = [[0] * (len_b + 1) for _ in range(len_a + 1)] | |
| for i in range(len_a + 1): | |
| dp[i][0] = i | |
| for j in range(len_b + 1): | |
| dp[0][j] = j | |
| for i in range(1, len_a + 1): | |
| ca = a[i - 1] | |
| for j in range(1, len_b + 1): | |
| cb = b[j - 1] | |
| cost = 0 if ca == cb else 1 | |
| dp[i][j] = min( | |
| dp[i - 1][j] + 1, | |
| dp[i][j - 1] + 1, | |
| dp[i - 1][j - 1] + cost, | |
| ) | |
| return dp[len_a][len_b] |
| output_csv = meeting_dir / "transcriptions_classified_with_audio.csv" | ||
| fields = list(records[0].keys()) if records else [] | ||
| if records: | ||
| with output_csv.open("w", encoding="utf-8", newline="") as handle: | ||
| handle.write(",".join(fields) + "\n") | ||
| for record in records: | ||
| row = [] | ||
| for field in fields: | ||
| value = record.get(field) | ||
| if isinstance(value, list): | ||
| value = "|".join(str(item) for item in value) | ||
| elif isinstance(value, dict): | ||
| value = json.dumps(value) | ||
| elif value is None: | ||
| value = "" | ||
| text = str(value) | ||
| if "," in text or "\n" in text or '"' in text: | ||
| text = '"' + text.replace('"', '""') + '"' | ||
| row.append(text) | ||
| handle.write(",".join(row) + "\n") |
There was a problem hiding this comment.
This script uses manual CSV writing (lines 242-257) instead of the more robust csv.DictWriter approach that is properly used in align_with_full_transcript.py. The PR description mentions "uses csv.DictWriter with full field union for robust CSV output" as a hardening improvement, but this script wasn't updated.
Manual CSV writing is more error-prone and harder to maintain. The implementation in align_with_full_transcript.py (lines 190-197) demonstrates the better approach using csv.DictWriter with proper field handling. Consider applying the same pattern here for consistency and robustness.
Visual regression reportNo visual diffs detected. Run: https://github.com/Chronote-gg/Chronote/actions/runs/21976613954 |
[AGENT] Extracts permanent, low-footprint learnings from the hallucination audit into reusable tooling and docs, without carrying large raw artifacts into mainline history.
What this PR includes
analysis/hallucination-audit/.docs/hallucination-audit-20260210.md.docs/hallucination-mitigation-plan-20260213.md..gitignorefor Python cache and husky internal files.Script hardening included
align_with_full_transcript.pycsv.DictWriterwith full field union for robust CSV outputcreate_langfuse_dataset_sample.py--sample-size--countsIntentionally excluded
Context