Phases 1-2 — Never-reject intake + real document support (PDF/Office/Excel/email)#12
Merged
Conversation
…/Excel/email) Phases 1-2 of the Reve Intelligence overhaul. Closes #2, closes #3. (These two phases are coupled — best-effort never-reject is realized by the parser router — so they ship as one coherent change.) Never hard-reject (#2): - Remove the extension whitelist; keep only operational limits (empty + oversize). - Drop the terminal "Unsupported" state and DocumentSupportPolicy. Unknown / low-confidence files are ingested as reviewable records that surface why they are uncertain, instead of being quarantined. - Raw export works for any record. Real document support (#3): - ParserRouter (IDocumentParser): first parser that claims the file by extension; on no-match or failure it falls back to best-effort visible text. Never throws except on cancellation. - Typed managed parsers: Text/Markdown, CSV, PDF (PdfPig; flags scanned PDFs for OCR), Word .docx + PowerPoint .pptx (Open XML SDK), Excel .xlsx (ClosedXML), email .eml (MimeKit) and Outlook .msg (MsgReader) with attachments parsed recursively through the same router. Friendly table names. - Remove the Python docling-worker from the default path. Tests: parser router (Excel, email-attachment recursion, binary fallback) and the updated intake contract. Build 0 warnings; unit 5/5 and integration 5/5; format clean. Verified in-app: a gibberish .xyz file and a real .eml both become reviewable records with honest low/extracted confidence and working export.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phases 1-2 of the Reve Intelligence overhaul. Closes #2, closes #3. Stacked on #10 (base = feat/reve-overhaul).
These two phases are coupled — best-effort "never reject" is only fully realized once the parser router exists — so they ship as one coherent change.
Never hard-reject (#2)
DocumentSupportPolicy. Unknown / low-confidence files are ingested as reviewable records that explain their uncertainty, rather than being quarantined.Real document support (#3)
ParserRouter(IDocumentParser): picks the first parser that claims a file by extension; on no-match or failure it falls back to best-effort visible text. Never throws except on cancellation — one obvious entry point.Proof
dotnet build0 warnings;dotnet testunit 5/5 + integration 5/5;dotnet format --verify-no-changesclean..xyzfile becomes a reviewable record (status Extracted, type Unknown, 26% confidence, export enabled); a real.emlis parsed (profileemail-eml), classified Bordereau, fields extracted from the body.Still ahead
Real OCR for scanned PDFs/images (#4 prep), honest per-field confidence + classifier fix (#5), and the reference-grade UI (#7) land in later phases.