Skip to content

[AGENT] Add reusable hallucination audit learnings#117

Draft
BASIC-BIT wants to merge 1 commit into
masterfrom
analysis-hallucination-learnings-20260213
Draft

[AGENT] Add reusable hallucination audit learnings#117
BASIC-BIT wants to merge 1 commit into
masterfrom
analysis-hallucination-learnings-20260213

Conversation

@BASIC-BIT
Copy link
Copy Markdown
Collaborator

[AGENT] Extracts permanent, low-footprint learnings from the hallucination audit into reusable tooling and docs, without carrying large raw artifacts into mainline history.

What this PR includes

  • Adds reusable audit scripts under analysis/hallucination-audit/.
  • Adds the audit findings summary doc docs/hallucination-audit-20260210.md.
  • Adds a concrete mitigation plan doc docs/hallucination-mitigation-plan-20260213.md.
  • Updates .gitignore for Python cache and husky internal files.

Script hardening included

  • align_with_full_transcript.py
    • returns word-window for exact substring matches
    • uses csv.DictWriter with full field union for robust CSV output
    • adds lightweight pruning before expensive similarity checks
  • create_langfuse_dataset_sample.py
    • default class counts now respect --sample-size
    • validates unknown class keys in --counts

Intentionally excluded

  • No large meeting artifacts, raw trace dumps, or audio blobs.
  • No production runtime behavior changes in bot paths.

Context

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 13, 2026

Greptile Overview

Greptile Summary

Extracts reusable hallucination audit tooling and documentation from audit work, including 6 Python analysis scripts, findings documentation, and a concrete mitigation plan. Scripts implement transcript classification, duplicate detection, audio volume analysis, full audio transcription, and Langfuse dataset sampling. Documentation captures key findings (379 hallucinated, 495 legit, 1178 unknown from 2052 traces) and recommends keeping current prompt-echo guards while adding tunable config keys and a vote-transcription path for suspicious snippets. All scripts follow clean architecture with proper error handling, retry logic for rate limits, and incremental output saves.

  • Adds complete audit toolkit under analysis/hallucination-audit/ with main orchestration script and supporting utilities
  • Documents audit findings with classification breakdowns, syllable-rate analysis, and audio loudness thresholds
  • Provides actionable mitigation plan with phased rollout strategy (config keys, vote transcription, threshold tuning)
  • Updates .gitignore for Python artifacts (__pycache__/, *.pyc) and husky internals (.husky/_)
  • All Python scripts properly handle env loading, API retries, and edge cases (rate limits, missing files, silence detection)

Confidence Score: 5/5

  • Safe to merge - well-architected analysis tooling with no production runtime changes
  • All changes are isolated to analysis scripts and documentation with no impact on bot runtime behavior. Scripts demonstrate good practices (retry logic, error handling, incremental saves). Documentation is clear and actionable. The PR explicitly excludes large artifacts and production changes.
  • No files require special attention

Important Files Changed

Filename Overview
analysis/hallucination-audit/align_with_full_transcript.py Implements exact and fuzzy transcript matching with word-window alignment
analysis/hallucination-audit/compute_audio_volume.py Fetches Langfuse audio media and computes volume metrics via ffmpeg
analysis/hallucination-audit/create_langfuse_dataset_sample.py Creates balanced Langfuse dataset samples for manual hallucination labeling
analysis/hallucination-audit/run_audit.py Main audit script that classifies transcriptions and detects duplicates
docs/hallucination-mitigation-plan-20260213.md Outlines mitigation strategy with config recommendations and rollout plan

Last reviewed commit: bb013a2

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb013a236a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +361 to +362
chosen_window = max(window_results.keys(), key=lambda k: len(window_results[k]))
traces = window_results[chosen_window]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor --meeting-id across all fetched windows

When --date is omitted, the script fetches both today and yesterday but then picks only the window with the most traces before it applies --meeting-id. If the requested meeting exists in the other window, filtered becomes empty and the audit proceeds with zero records, producing a misleading summary and incorrect downstream artifacts. Resolve the meeting ID against all fetched windows (or choose the matching window first) before narrowing the trace set.

Useful? React with 👍 / 👎.

Comment on lines +32 to +33
if list(segment_dir.glob("segment_*.mp3")):
return sorted(segment_dir.glob("segment_*.mp3"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Regenerate segments when audio/chunk inputs change

The segment cache is reused whenever any segment_*.mp3 exists, without verifying that those files were produced from the current audio input or current --chunk-seconds value. Re-running after replacing audio_combined.mp3 or changing chunk size will silently transcribe stale segments, so full_transcript.txt can be out of sync with the intended meeting audio and corrupt alignment results.

Useful? React with 👍 / 👎.

Comment on lines +130 to +131
if response.status_code in (400, 409):
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Raise on dataset-item 400 responses

Treating HTTP 400 as success causes malformed dataset-item requests to be silently dropped while the script continues as if sampling succeeded. If Langfuse rejects an item payload (schema change, invalid field, etc.), the resulting dataset can be smaller or imbalanced without any failure signal, which undermines audit/eval accuracy.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extracts reusable learnings and tooling from a hallucination audit (meeting ID 3837e4e0-64e9-44ba-b5de-c3a6849832d6, conducted on 2026-02-10) into the mainline repository without including large raw artifacts. The audit analyzed 2052 traces, finding 379 hallucinated outputs (primarily prompt echo), 495 legitimate outputs, and 1178 unknown. The PR provides documentation, a mitigation plan, and a set of Python scripts for future audit work.

Changes:

  • Adds documentation of audit findings and mitigation recommendations
  • Adds six reusable Python scripts for conducting hallucination audits
  • Updates .gitignore for Python cache files and husky internals

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
docs/hallucination-audit-20260210.md Documents the 2026-02-10 hallucination audit findings including classification counts, SPS analysis, and audio volume metrics
docs/hallucination-mitigation-plan-20260213.md Provides concrete mitigation recommendations including config changes, system improvements, and rollout strategy
analysis/hallucination-audit/README.md Overview of the audit workspace and available scripts
analysis/hallucination-audit/run_audit.py Main audit orchestration script that fetches Langfuse traces, classifies them, and builds duplicate groups
analysis/hallucination-audit/compute_audio_volume.py Computes mean and max audio volume (dB) for Langfuse media using ffmpeg
analysis/hallucination-audit/download_full_audio.py Downloads full meeting audio from S3 for analysis
analysis/hallucination-audit/transcribe_full_audio.py Transcribes full meeting audio using OpenAI in segments
analysis/hallucination-audit/align_with_full_transcript.py Aligns snippet transcripts with full transcript using fuzzy matching; uses csv.DictWriter for robust CSV output
analysis/hallucination-audit/create_langfuse_dataset_sample.py Creates balanced Langfuse dataset samples for labeling with validation of class keys
.gitignore Adds Python cache patterns and husky internal files

Comment on lines +37 to +114
def build_index(words: List[str]) -> Dict[str, List[int]]:
index: Dict[str, List[int]] = {}
for pos, word in enumerate(words):
if len(word) < 4:
continue
index.setdefault(word, []).append(pos)
return index


def find_subsequence_window(
snippet_words: List[str],
full_words: List[str],
) -> Optional[Tuple[int, int]]:
if not snippet_words or not full_words or len(snippet_words) > len(full_words):
return None
snippet_length = len(snippet_words)
last_start = len(full_words) - snippet_length
for start in range(last_start + 1):
if full_words[start : start + snippet_length] == snippet_words:
return (start, start + snippet_length)
return None


def best_match(
snippet_text: str,
snippet_words: List[str],
full_text: str,
full_words: List[str],
index: Dict[str, List[int]],
) -> Tuple[Optional[float], str, Optional[Tuple[int, int]]]:
if not snippet_text:
return None, "empty", None
if snippet_text in full_text:
return 1.0, "substring", find_subsequence_window(snippet_words, full_words)

unique_words = sorted(set(snippet_words), key=len, reverse=True)
candidates = [word for word in unique_words if len(word) >= 4][:3]
if not candidates:
return None, "no_candidates", None

window_size = max(8, min(len(full_words), len(snippet_words) + 6))
snippet_word_set = set(snippet_words)
best_score: Optional[float] = None
best_window: Optional[Tuple[int, int]] = None
for word in candidates:
positions = index.get(word, [])
if len(positions) > 100:
positions = positions[:100]
for pos in positions:
start = max(0, pos - 3)
end = min(len(full_words), start + window_size)
window_text = " ".join(full_words[start:end])
if not window_text:
continue

if best_score is not None:
max_possible = 1 - (
abs(len(snippet_text) - len(window_text))
/ max(len(snippet_text), len(window_text))
)
if max_possible <= best_score:
continue

window_word_set = set(full_words[start:end])
if snippet_word_set and window_word_set:
overlap_ratio = len(snippet_word_set & window_word_set) / len(
snippet_word_set
)
if overlap_ratio < 0.25:
continue

dist = levenshtein_distance(snippet_text, window_text)
ratio = dist / max(len(snippet_text), len(window_text))
score = 1 - ratio
if best_score is None or score > best_score:
best_score = score
best_window = (start, end)
return best_score, "fuzzy", best_window
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function contains several magic numbers that should be extracted as named constants for better code clarity:

  • 4 (lines 40, 73): minimum word length for indexing and candidate selection
  • 3 (line 73): maximum number of candidate words
  • 8 (line 77): minimum window size
  • 6 (line 77): window size padding
  • 100 (line 83): maximum positions to check per word
  • 3 (line 86): position offset for window start
  • 0.25 (line 105): minimum overlap ratio threshold

These magic numbers represent important thresholds for the fuzzy matching algorithm. Extracting them as named constants would make the algorithm's behavior more transparent and easier to tune.

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +271
and avg_logprob > -1.2
and min_logprob > -2.5
)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logprob thresholds -1.2 and -2.5 used to classify records as "legit" are magic numbers that should be extracted as named constants. These thresholds are critical for the classification algorithm and may need to be tuned based on further analysis.

Consider extracting these as module-level constants like CLEAN_AVG_LOGPROB_THRESHOLD = -1.2 and CLEAN_MIN_LOGPROB_THRESHOLD = -2.5 to make them more discoverable and easier to adjust.

Copilot uses AI. Check for mistakes.
data={
"model": model,
"response_format": "json",
"temperature": "0",
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temperature parameter is passed as a string "0" instead of a numeric value. While the OpenAI API may accept this, it's better practice to pass numeric parameters as their proper type (integer or float) for API calls. This should be "temperature": 0 or "temperature": 0.0.

Suggested change
"temperature": "0",
"temperature": 0,

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +34
def load_env(path: Path) -> None:
if not path.exists():
return
for line in path.read_text(encoding="utf-8").splitlines():
stripped = line.strip()
if not stripped or stripped.startswith("#") or "=" not in stripped:
continue
key, value = stripped.split("=", 1)
key = key.strip()
value = value.strip().strip('"').strip("'")
if key and key not in os.environ:
os.environ[key] = value
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_env function is duplicated across all six Python scripts. This violates the DRY (Don't Repeat Yourself) principle and makes maintenance harder. If a bug is found or improvement is needed in this function, it would need to be fixed in six places.

Consider extracting this function into a shared utility module (for example, analysis/hallucination-audit/utils.py or analysis/hallucination-audit/common.py) that all scripts can import from. This would ensure consistent behavior across all scripts and make future updates easier.

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +180
def build_near_duplicate_groups(records: List[Dict[str, Any]]) -> Dict[str, int]:
candidates: List[Tuple[str, str]] = []
for record in records:
norm = record.get("normalized_text")
if not norm or len(norm) < 12:
continue
candidates.append((record["trace_id"], norm))
if not candidates:
return {}

buckets: Dict[Tuple[int, str], List[Tuple[str, str]]] = defaultdict(list)
for trace_id, norm in candidates:
length_bucket = len(norm) // 20
prefix = norm[:5]
buckets[(length_bucket, prefix)].append((trace_id, norm))

parent: Dict[str, str] = {}

def find(x: str) -> str:
root = parent.get(x, x)
if root != x:
parent[x] = find(root)
return parent.get(x, x)

def union(a: str, b: str) -> None:
ra = find(a)
rb = find(b)
if ra != rb:
parent[rb] = ra

for items in buckets.values():
if len(items) < 2:
continue
if len(items) > 200:
continue
for i in range(len(items)):
trace_a, text_a = items[i]
for j in range(i + 1, len(items)):
trace_b, text_b = items[j]
if abs(len(text_a) - len(text_b)) > 20:
continue
dist = levenshtein_distance(text_a, text_b)
ratio = dist / max(len(text_a), len(text_b))
if ratio <= 0.2:
union(trace_a, trace_b)

groups: Dict[str, int] = {}
group_id = 1
clusters: Dict[str, List[str]] = defaultdict(list)
for trace_id, _ in candidates:
root = find(trace_id)
clusters[root].append(trace_id)
for trace_ids in clusters.values():
if len(trace_ids) < 2:
continue
for trace_id in trace_ids:
groups[trace_id] = group_id
group_id += 1
return groups
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function contains several magic numbers that should be extracted as named constants to improve code clarity:

  • 12 (line 126): minimum normalized text length for near-duplicate detection
  • 20 (line 134): length bucket divisor for grouping similar-length texts
  • 5 (line 135): prefix length for bucketing
  • 200 (line 155): maximum bucket size before skipping
  • 20 (line 161): maximum length difference threshold
  • 0.2 (line 165): maximum distance ratio for near-duplicates

These magic numbers represent important thresholds for the near-duplicate detection algorithm. Extracting them as named constants would make the code more maintainable and the algorithm's behavior more transparent.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +54
response = requests.get(
f"{base_url.rstrip('/')}/api/public/media/{media_id}",
auth=(public_key, secret_key),
timeout=60,
)
if response.status_code == 429:
retry_after = response.headers.get("Retry-After")
wait_seconds = float(retry_after) if retry_after else delay
time.sleep(wait_seconds)
delay = min(delay * 2, 30)
continue
response.raise_for_status()
payload = response.json()
return payload["url"]
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_media_url function can raise RuntimeError after exhausting retries, but it doesn't handle other HTTP errors (non-429 status codes) that might occur during the retry loop. If a 500 error or network error occurs, response.raise_for_status() will raise an exception that terminates the retry loop immediately, even though retrying might succeed.

Consider wrapping response.raise_for_status() in a try-except block that catches requests.HTTPError and requests.RequestException, retrying on transient errors while only raising on final failure.

Suggested change
response = requests.get(
f"{base_url.rstrip('/')}/api/public/media/{media_id}",
auth=(public_key, secret_key),
timeout=60,
)
if response.status_code == 429:
retry_after = response.headers.get("Retry-After")
wait_seconds = float(retry_after) if retry_after else delay
time.sleep(wait_seconds)
delay = min(delay * 2, 30)
continue
response.raise_for_status()
payload = response.json()
return payload["url"]
try:
response = requests.get(
f"{base_url.rstrip('/')}/api/public/media/{media_id}",
auth=(public_key, secret_key),
timeout=60,
)
if response.status_code == 429:
retry_after = response.headers.get("Retry-After")
wait_seconds = float(retry_after) if retry_after else delay
time.sleep(wait_seconds)
delay = min(delay * 2, 30)
continue
response.raise_for_status()
payload = response.json()
return payload["url"]
except (requests.HTTPError, requests.RequestException):
if attempt == retries - 1:
raise
time.sleep(delay)
delay = min(delay * 2, 30)
continue

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +133
def create_dataset_item(
base_url: str,
auth: Tuple[str, str],
payload: Dict[str, Any],
retries: int = 5,
) -> None:
delay = 0.5
for attempt in range(retries):
response = requests.post(
f"{base_url.rstrip('/')}/api/public/dataset-items",
auth=auth,
json=payload,
timeout=30,
)
if response.status_code == 429:
retry_after = response.headers.get("Retry-After")
wait_seconds = float(retry_after) if retry_after else delay
time.sleep(wait_seconds)
delay = min(delay * 2, 10)
continue
if response.status_code in (200, 201):
return
if response.status_code in (400, 409):
return
response.raise_for_status()
raise RuntimeError("rate_limited dataset-items")
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to fetch_media_url in compute_audio_volume.py, the create_dataset_item function can raise exceptions from response.raise_for_status() that terminate the retry loop prematurely. If a transient 5xx error or network error occurs, the function raises immediately instead of retrying.

Consider wrapping response.raise_for_status() in a try-except block to handle transient errors gracefully and only fail after exhausting all retries.

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +101
if len(a) < len(b):
a, b = b, a
previous = list(range(len(b) + 1))
for i, ca in enumerate(a, start=1):
current = [i]
for j, cb in enumerate(b, start=1):
insert_cost = current[j - 1] + 1
delete_cost = previous[j] + 1
replace_cost = previous[j - 1] + (0 if ca == cb else 1)
current.append(min(insert_cost, delete_cost, replace_cost))
previous = current
return previous[-1]
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalize_text and levenshtein_distance functions are duplicated in both run_audit.py and align_with_full_transcript.py. This code duplication violates the DRY principle and creates maintenance burden.

Consider extracting these shared functions into a common utility module that both scripts can import from. This ensures consistent text normalization and distance calculation logic across different parts of the audit tooling.

Suggested change
if len(a) < len(b):
a, b = b, a
previous = list(range(len(b) + 1))
for i, ca in enumerate(a, start=1):
current = [i]
for j, cb in enumerate(b, start=1):
insert_cost = current[j - 1] + 1
delete_cost = previous[j] + 1
replace_cost = previous[j - 1] + (0 if ca == cb else 1)
current.append(min(insert_cost, delete_cost, replace_cost))
previous = current
return previous[-1]
len_a = len(a)
len_b = len(b)
dp = [[0] * (len_b + 1) for _ in range(len_a + 1)]
for i in range(len_a + 1):
dp[i][0] = i
for j in range(len_b + 1):
dp[0][j] = j
for i in range(1, len_a + 1):
ca = a[i - 1]
for j in range(1, len_b + 1):
cb = b[j - 1]
cost = 0 if ca == cb else 1
dp[i][j] = min(
dp[i - 1][j] + 1,
dp[i][j - 1] + 1,
dp[i - 1][j - 1] + cost,
)
return dp[len_a][len_b]

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +257
output_csv = meeting_dir / "transcriptions_classified_with_audio.csv"
fields = list(records[0].keys()) if records else []
if records:
with output_csv.open("w", encoding="utf-8", newline="") as handle:
handle.write(",".join(fields) + "\n")
for record in records:
row = []
for field in fields:
value = record.get(field)
if isinstance(value, list):
value = "|".join(str(item) for item in value)
elif isinstance(value, dict):
value = json.dumps(value)
elif value is None:
value = ""
text = str(value)
if "," in text or "\n" in text or '"' in text:
text = '"' + text.replace('"', '""') + '"'
row.append(text)
handle.write(",".join(row) + "\n")
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script uses manual CSV writing (lines 242-257) instead of the more robust csv.DictWriter approach that is properly used in align_with_full_transcript.py. The PR description mentions "uses csv.DictWriter with full field union for robust CSV output" as a hardening improvement, but this script wasn't updated.

Manual CSV writing is more error-prone and harder to maintain. The implementation in align_with_full_transcript.py (lines 190-197) demonstrates the better approach using csv.DictWriter with proper field handling. Consider applying the same pattern here for consistency and robustness.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Visual regression report

No visual diffs detected.

Run: https://github.com/Chronote-gg/Chronote/actions/runs/21976613954

@BASIC-BIT BASIC-BIT marked this pull request as draft February 14, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants