Skip to content

Resolve unresolved upstream March OCR/Ollama items#131

Merged
ivanzud merged 16 commits into
mainfrom
integrate/upstream-wave-20260308b
Mar 8, 2026
Merged

Resolve unresolved upstream March OCR/Ollama items#131
ivanzud merged 16 commits into
mainfrom
integrate/upstream-wave-20260308b

Conversation

@ivanzud

@ivanzud ivanzud commented Mar 8, 2026

Copy link
Copy Markdown
Owner

This batch handles only upstream items that were still unresolved on ivanzud/paperless-gpt after PR #130.

Merged upstream work:

Local follow-up on top of upstream:

  • implement the core #926 OCR-then-classify chaining path (AUTO_OCR_THEN_CLASSIFY=true)
  • add OLLAMA_THINK and VISION_OLLAMA_THINK to explicitly enable/disable Ollama thinking mode (#931)
  • preserve the OCR-complete tag when chaining classification after OCR

Validation:

  • go build .
  • go test ./sanitize
  • go test ./ocr -run "^(TestOllamaThinkingCallOptions_Disabled|TestLLMProvider_WithPrompt|TestRenderOCRPrompt|TestResolveVisionOllamaHost)$"

Known unrelated baseline issues in this checkout:

  • root package tests fail to compile in pre-existing app_llm_test.go / paperless_test.go
  • full go test ./ocr still fails on the existing OpenAI-compatible no-key test

BieggerM and others added 16 commits March 4, 2026 21:51
Introduce sanitize package to strip configured patterns from content
before sending to LLMs. Supports literal string removal via
REMOVE_FROM_CONTENT env var and regex pattern removal via
REMOVE_FROM_CONTENT_REGEX env var.

Apply sanitization to document content processing and OCR prompts
to prevent sensitive information from being sent to external LLM APIs.
Content Sanitization Feature
This is to help prevent a task-switching overhead that comes from the potentially long startup times of models.
Red phase: tests for SetPrompt/GetPrompt on LLMProvider and
renderOCRPrompt function that renders OCR templates per-document
with existing document content.

Ref: icereed#882
Moves OCR template rendering from startup-only to per-document,
following the same pattern used by title, tag, correspondent, and
all other prompt templates. Enables the OCR prompt to reference the
document's existing text via {{.Content}}.

Changes:
- Add renderOCRPrompt() for per-document template rendering
- Add SetPrompt/GetPrompt on LLMProvider for per-document overrides
- Add ExistingContent to OCROptions, passed from background processor
- Update default ocr_prompt.tmpl with conditional {{.Content}} block
- Startup rendering passes Content="" as fallback (backward compatible)

Templates without {{.Content}} work identically to before. Users with
custom prompts opt in by adding {{if .Content}}...{{end}} to their
template.

Ref: icereed#882
- Replace SetPrompt with WithPrompt to avoid mutating the shared
  ocrProvider singleton (concurrency safety)
- Cap existing content to 8000 chars before injecting into OCR prompt
  to avoid blowing vision model context on long documents
- Tighten language test assertion with t.Setenv and assert.Equal

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…m debug log

- Remove empty-string guard so intentionally-empty templates take effect
- Log prompt length instead of full content in debug output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ivanzud ivanzud merged commit b3928fe into main Mar 8, 2026
0 of 12 checks passed
@ivanzud ivanzud deleted the integrate/upstream-wave-20260308b branch March 8, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants