Skip to content

[AGENT] Add hallucination audit artifacts and docs#116

Closed
BASIC-BIT wants to merge 2 commits into
masterfrom
analysis-hallucination-audit-20260210
Closed

[AGENT] Add hallucination audit artifacts and docs#116
BASIC-BIT wants to merge 2 commits into
masterfrom
analysis-hallucination-audit-20260210

Conversation

@BASIC-BIT
Copy link
Copy Markdown
Collaborator

[AGENT] Adds the 2026-02-10 hallucination audit artifacts, reproducibility notes, and documentation for meeting 3837e4e0-64e9-44ba-b5de-c3a6849832d6.

What this includes

  • Hallucination audit datasets and reports under analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6
  • Audit scripts and reproducibility docs, including analysis/hallucination-audit/README.md
  • Project-level doc update in docs/hallucination-audit-20260210.md
  • Git LFS tracking updates for large binary audit artifacts

Context

Preserve reproducible traces, audio, and summaries for future analysis.
Copilot AI review requested due to automatic review settings February 11, 2026 17:58
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 11, 2026

Too many files changed for review. (1381 files found, 500 file limit)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reproducible hallucination-audit workspace for meeting 3837e4e0-64e9-44ba-b5de-c3a6849832d6, including datasets, reports, and Git LFS tracking for large artifacts.

Changes:

  • Added hallucination audit artifacts (reports, summaries, transcripts, datasets, audio clips) for the referenced meeting.
  • Added reproducibility script align_with_full_transcript.py and workspace README.
  • Added .gitattributes Git LFS rules for large audit artifacts (audio + raw traces).

Reviewed changes

Copilot reviewed 285 out of 1381 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
analysis/hallucination-audit/align_with_full_transcript.py Adds a script to align snippet transcriptions against a full-meeting transcript and emit enriched JSON/CSV outputs.
analysis/hallucination-audit/README.md Documents the audit workspace layout and points to key reports/scripts.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/report.md Captures audit results/metrics and references generated artifacts for the meeting.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/summary.md Provides a compact at-a-glance audit summary (counts + duplicate group stats).
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_transcript.txt Stores the full-meeting transcription used for alignment.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_transcript_segments.json Stores chunk-level transcription segments (source for the combined transcript).
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/meeting_history_query.json Captures the meeting-history query results used during the audit.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/meeting_history_match.json Stores the matched meeting-history entry used to confirm metadata/notes.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/reason_counts.json Stores aggregated hallucination/unknown/legit reason counts.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/recording_transcript.json Records the RecordingTranscript lookup result for the meeting (null here).
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_metadata.json Captures metadata about the downloaded full-meeting audio.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_transcription.json Stores a minimal reproduction transcription result for a specific clip.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_transcription_with_prompt.json Stores the same reproduction with a prompt, to compare behavior.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_from_full_audio_transcriptions.json Stores reproduction transcriptions extracted from the full-meeting audio.
.gitattributes Adds Git LFS tracking patterns for large audit artifacts (raw traces + mp3s).
analysis/hallucination-audit/audio_cache/*.mp3 Adds many Git LFS pointer files for cached snippet audio used in the audit.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/audio_combined.mp3 Adds Git LFS pointer for the full-meeting combined audio.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/no-no-no_from_full_audio.mp3 Adds Git LFS pointer for a reproduction clip extracted from the full audio.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_000.mp3 Adds Git LFS pointer for full-audio chunk 0 used in full transcription.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_001.mp3 Adds Git LFS pointer for full-audio chunk 1 used in full transcription.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_002.mp3 Adds Git LFS pointer for full-audio chunk 2 used in full transcription.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_003.mp3 Adds Git LFS pointer for full-audio chunk 3 used in full transcription.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_004.mp3 Adds Git LFS pointer for full-audio chunk 4 used in full transcription.
analysis/hallucination-audit/3837e4e0-64e9-44ba-b5de-c3a6849832d6/full_audio_segments/segment_005.mp3 Adds Git LFS pointer for full-audio chunk 5 used in full transcription.

Comment on lines +54 to +55
if snippet_text in full_text:
return 1.0, "substring", None
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best_match() reports a "substring" match but always returns None for the match window. This loses useful location data (start/end token span) even though the match is exact, and downstream outputs (full_transcript_match_window) will be empty for the most reliable case. Consider returning a best-effort (start, end) window for substring matches (e.g., locate the snippet within full_text and map that location back to word indices, or find the snippet_words sequence in full_words).

Copilot uses AI. Check for mistakes.
Comment on lines +136 to +152
with output_csv.open("w", encoding="utf-8", newline="") as handle:
handle.write(",".join(fields) + "\n")
for record in records:
row = []
for field in fields:
value = record.get(field)
if isinstance(value, list):
value = "|".join(str(item) for item in value)
elif isinstance(value, dict):
value = json.dumps(value)
elif value is None:
value = ""
text = str(value)
if "," in text or "\n" in text or '"' in text:
text = '"' + text.replace('"', '""') + '"'
row.append(text)
handle.write(",".join(row) + "\n")
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV writer is hand-rolled, which is easy to get subtly wrong (e.g., handling \r\n, consistent quoting rules, and consistent field sets when records have different keys). Using Python’s csv module (e.g., csv.DictWriter) will make output more robust and simpler to maintain; you can also compute fieldnames as the union of observed keys to avoid silently dropping fields that appear after records[0].

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +77
dist = levenshtein_distance(snippet_text, window_text)
ratio = dist / max(len(snippet_text), len(window_text))
score = 1 - ratio
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computes full Levenshtein distances on raw character strings for each candidate window, which can become expensive (O(n*m) per comparison) as snippet/window lengths grow. Since this is an analysis script and may be run over large trace sets, consider adding an early-exit/bounded-distance optimization (e.g., stop computing once the minimum possible score can’t beat best_score), or switch to a cheaper similarity heuristic before running Levenshtein (e.g., token overlap / difflib.SequenceMatcher().quick_ratio() as a gate).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60f4a514de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +121 to +125
policy === "anyone_in_channel" ||
(policy === "trigger_or_admin" &&
meeting.startTriggeredByUserId === invokerId);

if (!(admin || soloNonBot || allowedByPolicy)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor trigger_or_admin dismiss policy strictly

The permission gate in handleDismissAutoRecord still allows soloNonBot regardless of the configured policy, so trigger_or_admin does not actually restrict dismissal to the triggering user or admins. In practice, if the trigger user leaves and a different non-admin participant is the only person left, that user can still stop recording even though the policy label says “Trigger or admin,” which makes the config behavior misleading.

Useful? React with 👍 / 👎.

Comment on lines +195 to +198
if (meeting.cancelled) {
await runMeetingEndStep(meeting, "auto-record-cancel-flow", () =>
handleAutoRecordCancellation(meeting, chatLogFilePath),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep low-content cancellation metric from manual dismisses

This new meeting.cancelled short-circuit sends explicit user dismissals through the same cancellation flow used for low-content auto-cancels, and that flow increments meeting_cancelled_total (documented as low-content cancellations). Because dismissAutoRecord now sets meeting.cancelled = true, manual stops will inflate that metric and skew monitoring or experiment analysis that depends on auto-cancel rates.

Useful? React with 👍 / 👎.

Comment on lines +58 to +60
if not raw:
return {"hallucinated": 40, "unknown": 40, "legit": 20}
parts = raw.split(",")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect --sample-size when defaulting dataset class counts

parse_counts ignores the requested total when --counts is omitted and always returns a fixed 40/40/20 split. This means running create_langfuse_dataset_sample.py with a non-default --sample-size silently produces 100 items unless the caller also passes --counts, which can invalidate sampling assumptions for audit datasets.

Useful? React with 👍 / 👎.

@BASIC-BIT
Copy link
Copy Markdown
Collaborator Author

[AGENT] Superseded by #117. Keeping this PR as an archival snapshot of the full raw audit artifacts and LFS payload, and intentionally not merging it into master.

@BASIC-BIT
Copy link
Copy Markdown
Collaborator Author

[AGENT] Closing this artifact-heavy PR in favor of the lightweight learnings PR #117. Branch history remains available for audit reference.

@BASIC-BIT BASIC-BIT closed this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants