Skip to content

feat(control): extract text from PDF attachments / 提取 PDF 附件文本#3618

Merged
esengine merged 2 commits into
main-v2from
codex/pdf-attachment-extraction
Jun 10, 2026
Merged

feat(control): extract text from PDF attachments / 提取 PDF 附件文本#3618
esengine merged 2 commits into
main-v2from
codex/pdf-attachment-extraction

Conversation

@SivanCola

@SivanCola SivanCola commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • extract text from referenced PDF attachments before injecting file context
  • prefer pdftotext, then fall back to common Python PDF libraries when available
  • bound PDF extractor stdout/stderr output and apply kill-tree timeout handling for subprocesses
  • preserve prompt-cache stability by keeping extracted PDF text in the per-turn dynamic reference block, not in stable prompts or tool schemas
  • return an OCR/multimodal guidance note when text extraction is unavailable or the PDF appears scanned/image-only

Testing

  • go test ./internal/control -run 'TestReadFileRef|TestRunPDFTextCommandCapsStderr|TestResolveRefsAttachmentKinds|TestFileRefLine|TestParseRefTokens|TestClassifyRef'
  • HOME-isolated go test ./internal/control
  • go test ./internal/proc
  • git diff --check

@SivanCola SivanCola requested a review from esengine as a code owner June 9, 2026 02:39
@github-actions github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development agent Core agent loop (internal/agent, internal/control) and removed v2 Go rewrite (1.x) — main-v2 branch, active development labels Jun 9, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0b2a69aa89

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/control/refs.go Outdated
Comment thread internal/control/refs.go
@github-actions github-actions Bot added the v2 Go rewrite (1.x) — main-v2 branch, active development label Jun 9, 2026

@esengine esengine left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid — @-referencing a PDF now extracts its text (pdftotext, then a Python fallback), with a context-timeout + process-tree kill + bounded buffers, args passed directly to exec (no shell), and a clean message pointing at OCR/vision for scanned PDFs. Testable seam + refs_test.go. Thanks!

@esengine esengine merged commit 5f02fd7 into main-v2 Jun 10, 2026
10 checks passed
@esengine esengine deleted the codex/pdf-attachment-extraction branch June 10, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Core agent loop (internal/agent, internal/control) v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants