Context
PR #210 added native PDF rehydration with a text fallback for PDFs that exceed provider byte budgets (Anthropic 32 MB, OpenAI 50 MB) or are sent to PDF-incapable models. The fallback inlines the first ~200 KB of unpdf-extracted text and truncates the rest.
This works for the common case (small or text-dense PDFs) but leaves a gap.
The gap
For an oversized PDF (>32 MB Anthropic / >50 MB OpenAI, or any size on a non-PDF model), the model gets:
--- Attached PDF: report.pdf (fl_xxx, 100 MB) ---
<first ~200 KB of extracted text>
[... truncated at 200 KB]
User asks "look at page 50":
- If page 50 is past the 200 KB cutoff → the model has nothing.
- If page 50 is charts/diagrams (no extractable text) → the model has nothing, even on the native path under the budget.
- The truncation suffix intentionally omits the
files__read hint to avoid base64 replay loops, so the model isn't even nudged toward an escape valve.
files__read exists as a tool but returns the entire file as base64 — calling it on a 100 MB PDF replays 100 MB of base64 into the conversation, exactly the loop PR #210 was avoiding.
Proposed
A paged read tool: files__read_pdf_pages(id, pages=[50]) (or a range) that returns either extracted text for those specific pages, or a rendered image for visual content. Scope:
- Input: file id + page list or range.
- Output: per-page text via
unpdf, optionally rasterized image for visual pages.
- Cheap: only loads / extracts the requested pages.
- Discoverable: the rehydrate text-fallback marker can point to this tool by name without triggering base64 replay (because the tool itself is bounded by page count, not file size).
Notes
- The PR's sidecar caches the truncated extraction. A paged tool would not use the sidecar; it would extract on-demand per page. Different cache shape (per-page) if we want caching here at all.
- Rendered-image output requires a PDF→image pipeline (
pdf-img-extract or similar); unpdf is text-only.
Related: #189 (the issue PR #210 addressed), #210 (the rehydration PR itself).
Context
PR #210 added native PDF rehydration with a text fallback for PDFs that exceed provider byte budgets (Anthropic 32 MB, OpenAI 50 MB) or are sent to PDF-incapable models. The fallback inlines the first ~200 KB of
unpdf-extracted text and truncates the rest.This works for the common case (small or text-dense PDFs) but leaves a gap.
The gap
For an oversized PDF (>32 MB Anthropic / >50 MB OpenAI, or any size on a non-PDF model), the model gets:
User asks "look at page 50":
files__readhint to avoid base64 replay loops, so the model isn't even nudged toward an escape valve.files__readexists as a tool but returns the entire file as base64 — calling it on a 100 MB PDF replays 100 MB of base64 into the conversation, exactly the loop PR #210 was avoiding.Proposed
A paged read tool:
files__read_pdf_pages(id, pages=[50])(or a range) that returns either extracted text for those specific pages, or a rendered image for visual content. Scope:unpdf, optionally rasterized image for visual pages.Notes
pdf-img-extractor similar);unpdfis text-only.Related: #189 (the issue PR #210 addressed), #210 (the rehydration PR itself).