Skip to content

Phases 1-2 — Never-reject intake + real document support (PDF/Office/Excel/email)#12

Merged
xt0n1-t3ch merged 1 commit into
mainfrom
feat/phase-2-parsers
Jun 13, 2026
Merged

Phases 1-2 — Never-reject intake + real document support (PDF/Office/Excel/email)#12
xt0n1-t3ch merged 1 commit into
mainfrom
feat/phase-2-parsers

Conversation

@xt0n1-t3ch

Copy link
Copy Markdown
Owner

Phases 1-2 of the Reve Intelligence overhaul. Closes #2, closes #3. Stacked on #10 (base = feat/reve-overhaul).

These two phases are coupled — best-effort "never reject" is only fully realized once the parser router exists — so they ship as one coherent change.

Never hard-reject (#2)

  • Removed the extension whitelist from intake (only empty + oversize limits remain).
  • Dropped the terminal "Unsupported" state and DocumentSupportPolicy. Unknown / low-confidence files are ingested as reviewable records that explain their uncertainty, rather than being quarantined.
  • Raw export works for any record.

Real document support (#3)

  • ParserRouter (IDocumentParser): picks the first parser that claims a file by extension; on no-match or failure it falls back to best-effort visible text. Never throws except on cancellation — one obvious entry point.
  • Typed, managed parsers (no Python): Text/Markdown, CSV, PDF (PdfPig text layer; flags scanned PDFs for the upcoming OCR pass), Word .docx + PowerPoint .pptx (Open XML SDK), Excel .xlsx (ClosedXML), email .eml (MimeKit) and Outlook .msg (MsgReader) with attachments parsed recursively through the same router.
  • Removed the Python docling-worker from the default path.

Proof

  • dotnet build 0 warnings; dotnet test unit 5/5 + integration 5/5; dotnet format --verify-no-changes clean.
  • New tests: Excel parsing, email-attachment recursion, never-throw binary fallback, and the updated intake contract.
  • In-app: a gibberish .xyz file becomes a reviewable record (status Extracted, type Unknown, 26% confidence, export enabled); a real .eml is parsed (profile email-eml), classified Bordereau, fields extracted from the body.

Still ahead

Real OCR for scanned PDFs/images (#4 prep), honest per-field confidence + classifier fix (#5), and the reference-grade UI (#7) land in later phases.

…/Excel/email)

Phases 1-2 of the Reve Intelligence overhaul. Closes #2, closes #3.
(These two phases are coupled — best-effort never-reject is realized by the
parser router — so they ship as one coherent change.)

Never hard-reject (#2):
- Remove the extension whitelist; keep only operational limits (empty + oversize).
- Drop the terminal "Unsupported" state and DocumentSupportPolicy. Unknown /
  low-confidence files are ingested as reviewable records that surface why they
  are uncertain, instead of being quarantined.
- Raw export works for any record.

Real document support (#3):
- ParserRouter (IDocumentParser): first parser that claims the file by extension;
  on no-match or failure it falls back to best-effort visible text. Never throws
  except on cancellation.
- Typed managed parsers: Text/Markdown, CSV, PDF (PdfPig; flags scanned PDFs for
  OCR), Word .docx + PowerPoint .pptx (Open XML SDK), Excel .xlsx (ClosedXML),
  email .eml (MimeKit) and Outlook .msg (MsgReader) with attachments parsed
  recursively through the same router. Friendly table names.
- Remove the Python docling-worker from the default path.

Tests: parser router (Excel, email-attachment recursion, binary fallback) and the
updated intake contract. Build 0 warnings; unit 5/5 and integration 5/5; format clean.
Verified in-app: a gibberish .xyz file and a real .eml both become reviewable records
with honest low/extracted confidence and working export.
@xt0n1-t3ch xt0n1-t3ch changed the base branch from feat/reve-overhaul to main June 13, 2026 17:44
@xt0n1-t3ch xt0n1-t3ch merged commit fab16ef into main Jun 13, 2026
@xt0n1-t3ch xt0n1-t3ch deleted the feat/phase-2-parsers branch June 13, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 2 — Parser router + typed parsers (PDF/Office/Excel/Email/.msg) Phase 1 — Never hard-reject intake (FileTypeSniffer + BinaryFallbackParser)

1 participant