[AGENT] Add hallucination audit artifacts and docs#116
Conversation
Preserve reproducible traces, audio, and summaries for future analysis.
|
Too many files changed for review. ( |
There was a problem hiding this comment.
Pull request overview
Adds a reproducible hallucination-audit workspace for meeting 3837e4e0-64e9-44ba-b5de-c3a6849832d6, including datasets, reports, and Git LFS tracking for large artifacts.
Changes:
- Added hallucination audit artifacts (reports, summaries, transcripts, datasets, audio clips) for the referenced meeting.
- Added reproducibility script
align_with_full_transcript.pyand workspace README. - Added
.gitattributesGit LFS rules for large audit artifacts (audio + raw traces).
Reviewed changes
Copilot reviewed 285 out of 1381 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| analysis/hallucination-audit/align_with_full_transcript.py | Adds a script to align snippet transcriptions against a full-meeting transcript and emit enriched JSON/CSV outputs. |
| analysis/hallucination-audit/README.md | Documents the audit workspace layout and points to key reports/scripts. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/report.md | Captures audit results/metrics and references generated artifacts for the meeting. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/summary.md | Provides a compact at-a-glance audit summary (counts + duplicate group stats). |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_transcript.txt | Stores the full-meeting transcription used for alignment. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_transcript_segments.json | Stores chunk-level transcription segments (source for the combined transcript). |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/meeting_history_query.json | Captures the meeting-history query results used during the audit. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/meeting_history_match.json | Stores the matched meeting-history entry used to confirm metadata/notes. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/reason_counts.json | Stores aggregated hallucination/unknown/legit reason counts. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/recording_transcript.json | Records the RecordingTranscript lookup result for the meeting (null here). |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_metadata.json | Captures metadata about the downloaded full-meeting audio. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_transcription.json | Stores a minimal reproduction transcription result for a specific clip. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_transcription_with_prompt.json | Stores the same reproduction with a prompt, to compare behavior. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_from_full_audio_transcriptions.json | Stores reproduction transcriptions extracted from the full-meeting audio. |
| .gitattributes | Adds Git LFS tracking patterns for large audit artifacts (raw traces + mp3s). |
| analysis/hallucination-audit/audio_cache/*.mp3 | Adds many Git LFS pointer files for cached snippet audio used in the audit. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/audio_combined.mp3 | Adds Git LFS pointer for the full-meeting combined audio. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_from_full_audio.mp3 | Adds Git LFS pointer for a reproduction clip extracted from the full audio. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_000.mp3 | Adds Git LFS pointer for full-audio chunk 0 used in full transcription. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_001.mp3 | Adds Git LFS pointer for full-audio chunk 1 used in full transcription. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_002.mp3 | Adds Git LFS pointer for full-audio chunk 2 used in full transcription. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_003.mp3 | Adds Git LFS pointer for full-audio chunk 3 used in full transcription. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_004.mp3 | Adds Git LFS pointer for full-audio chunk 4 used in full transcription. |
| analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_005.mp3 | Adds Git LFS pointer for full-audio chunk 5 used in full transcription. |
| if snippet_text in full_text: | ||
| return 1.0, "substring", None |
There was a problem hiding this comment.
best_match() reports a "substring" match but always returns None for the match window. This loses useful location data (start/end token span) even though the match is exact, and downstream outputs (full_transcript_match_window) will be empty for the most reliable case. Consider returning a best-effort (start, end) window for substring matches (e.g., locate the snippet within full_text and map that location back to word indices, or find the snippet_words sequence in full_words).
| with output_csv.open("w", encoding="utf-8", newline="") as handle: | ||
| handle.write(",".join(fields) + "\n") | ||
| for record in records: | ||
| row = [] | ||
| for field in fields: | ||
| value = record.get(field) | ||
| if isinstance(value, list): | ||
| value = "|".join(str(item) for item in value) | ||
| elif isinstance(value, dict): | ||
| value = json.dumps(value) | ||
| elif value is None: | ||
| value = "" | ||
| text = str(value) | ||
| if "," in text or "\n" in text or '"' in text: | ||
| text = '"' + text.replace('"', '""') + '"' | ||
| row.append(text) | ||
| handle.write(",".join(row) + "\n") |
There was a problem hiding this comment.
The CSV writer is hand-rolled, which is easy to get subtly wrong (e.g., handling \r\n, consistent quoting rules, and consistent field sets when records have different keys). Using Python’s csv module (e.g., csv.DictWriter) will make output more robust and simpler to maintain; you can also compute fieldnames as the union of observed keys to avoid silently dropping fields that appear after records[0].
| dist = levenshtein_distance(snippet_text, window_text) | ||
| ratio = dist / max(len(snippet_text), len(window_text)) | ||
| score = 1 - ratio |
There was a problem hiding this comment.
This computes full Levenshtein distances on raw character strings for each candidate window, which can become expensive (O(n*m) per comparison) as snippet/window lengths grow. Since this is an analysis script and may be run over large trace sets, consider adding an early-exit/bounded-distance optimization (e.g., stop computing once the minimum possible score can’t beat best_score), or switch to a cheaper similarity heuristic before running Levenshtein (e.g., token overlap / difflib.SequenceMatcher().quick_ratio() as a gate).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60f4a514de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| policy === "anyone_in_channel" || | ||
| (policy === "trigger_or_admin" && | ||
| meeting.startTriggeredByUserId === invokerId); | ||
|
|
||
| if (!(admin || soloNonBot || allowedByPolicy)) { |
There was a problem hiding this comment.
Honor trigger_or_admin dismiss policy strictly
The permission gate in handleDismissAutoRecord still allows soloNonBot regardless of the configured policy, so trigger_or_admin does not actually restrict dismissal to the triggering user or admins. In practice, if the trigger user leaves and a different non-admin participant is the only person left, that user can still stop recording even though the policy label says “Trigger or admin,” which makes the config behavior misleading.
Useful? React with 👍 / 👎.
| if (meeting.cancelled) { | ||
| await runMeetingEndStep(meeting, "auto-record-cancel-flow", () => | ||
| handleAutoRecordCancellation(meeting, chatLogFilePath), | ||
| ); |
There was a problem hiding this comment.
Keep low-content cancellation metric from manual dismisses
This new meeting.cancelled short-circuit sends explicit user dismissals through the same cancellation flow used for low-content auto-cancels, and that flow increments meeting_cancelled_total (documented as low-content cancellations). Because dismissAutoRecord now sets meeting.cancelled = true, manual stops will inflate that metric and skew monitoring or experiment analysis that depends on auto-cancel rates.
Useful? React with 👍 / 👎.
| if not raw: | ||
| return {"hallucinated": 40, "unknown": 40, "legit": 20} | ||
| parts = raw.split(",") |
There was a problem hiding this comment.
Respect --sample-size when defaulting dataset class counts
parse_counts ignores the requested total when --counts is omitted and always returns a fixed 40/40/20 split. This means running create_langfuse_dataset_sample.py with a non-default --sample-size silently produces 100 items unless the caller also passes --counts, which can invalidate sampling assumptions for audit datasets.
Useful? React with 👍 / 👎.
|
[AGENT] Superseded by #117. Keeping this PR as an archival snapshot of the full raw audit artifacts and LFS payload, and intentionally not merging it into master. |
|
[AGENT] Closing this artifact-heavy PR in favor of the lightweight learnings PR #117. Branch history remains available for audit reference. |
[AGENT] Adds the 2026-02-10 hallucination audit artifacts, reproducibility notes, and documentation for meeting
3837e4e0-64e9-44ba-b5de-c3a6849832d6.What this includes
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6analysis/hallucination-audit/README.mddocs/hallucination-audit-20260210.mdContext