files: add paged read tool for oversized PDFs

## Context

PR #210 added native PDF rehydration with a text fallback for PDFs that exceed provider byte budgets (Anthropic 32 MB, OpenAI 50 MB) or are sent to PDF-incapable models. The fallback inlines the first ~200 KB of `unpdf`-extracted text and truncates the rest.

This works for the common case (small or text-dense PDFs) but leaves a gap.

## The gap

For an oversized PDF (>32 MB Anthropic / >50 MB OpenAI, or any size on a non-PDF model), the model gets:

```
--- Attached PDF: report.pdf (fl_xxx, 100 MB) ---
<first ~200 KB of extracted text>
[... truncated at 200 KB]
```

User asks "look at page 50":
- If page 50 is past the 200 KB cutoff → the model has nothing.
- If page 50 is charts/diagrams (no extractable text) → the model has nothing, even on the native path under the budget.
- The truncation suffix intentionally omits the `files__read` hint to avoid base64 replay loops, so the model isn't even nudged toward an escape valve.

`files__read` exists as a tool but returns the entire file as base64 — calling it on a 100 MB PDF replays 100 MB of base64 into the conversation, exactly the loop PR #210 was avoiding.

## Proposed

A paged read tool: `files__read_pdf_pages(id, pages=[50])` (or a range) that returns either extracted text for those specific pages, or a rendered image for visual content. Scope:

- Input: file id + page list or range.
- Output: per-page text via `unpdf`, optionally rasterized image for visual pages.
- Cheap: only loads / extracts the requested pages.
- Discoverable: the rehydrate text-fallback marker can point to this tool by name without triggering base64 replay (because the tool itself is bounded by page count, not file size).

## Notes

- The PR's sidecar caches the *truncated* extraction. A paged tool would not use the sidecar; it would extract on-demand per page. Different cache shape (per-page) if we want caching here at all.
- Rendered-image output requires a PDF→image pipeline (`pdf-img-extract` or similar); `unpdf` is text-only.

Related: #189 (the issue PR #210 addressed), #210 (the rehydration PR itself).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

files: add paged read tool for oversized PDFs #234

Context

The gap

Proposed

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

files: add paged read tool for oversized PDFs #234

Description

Context

The gap

Proposed

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions