Skip to content

files: add paged read tool for oversized PDFs #234

@mgoldsborough

Description

@mgoldsborough

Context

PR #210 added native PDF rehydration with a text fallback for PDFs that exceed provider byte budgets (Anthropic 32 MB, OpenAI 50 MB) or are sent to PDF-incapable models. The fallback inlines the first ~200 KB of unpdf-extracted text and truncates the rest.

This works for the common case (small or text-dense PDFs) but leaves a gap.

The gap

For an oversized PDF (>32 MB Anthropic / >50 MB OpenAI, or any size on a non-PDF model), the model gets:

--- Attached PDF: report.pdf (fl_xxx, 100 MB) ---
<first ~200 KB of extracted text>
[... truncated at 200 KB]

User asks "look at page 50":

  • If page 50 is past the 200 KB cutoff → the model has nothing.
  • If page 50 is charts/diagrams (no extractable text) → the model has nothing, even on the native path under the budget.
  • The truncation suffix intentionally omits the files__read hint to avoid base64 replay loops, so the model isn't even nudged toward an escape valve.

files__read exists as a tool but returns the entire file as base64 — calling it on a 100 MB PDF replays 100 MB of base64 into the conversation, exactly the loop PR #210 was avoiding.

Proposed

A paged read tool: files__read_pdf_pages(id, pages=[50]) (or a range) that returns either extracted text for those specific pages, or a rendered image for visual content. Scope:

  • Input: file id + page list or range.
  • Output: per-page text via unpdf, optionally rasterized image for visual pages.
  • Cheap: only loads / extracts the requested pages.
  • Discoverable: the rehydrate text-fallback marker can point to this tool by name without triggering base64 replay (because the tool itself is bounded by page count, not file size).

Notes

  • The PR's sidecar caches the truncated extraction. A paged tool would not use the sidecar; it would extract on-demand per page. Different cache shape (per-page) if we want caching here at all.
  • Rendered-image output requires a PDF→image pipeline (pdf-img-extract or similar); unpdf is text-only.

Related: #189 (the issue PR #210 addressed), #210 (the rehydration PR itself).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions