Resolve unresolved upstream March OCR/Ollama items by ivanzud · Pull Request #131 · ivanzud/paperless-gpt

ivanzud · 2026-03-08T17:19:43Z

This batch handles only upstream items that were still unresolved on ivanzud/paperless-gpt after PR #130.

Merged upstream work:

Content Sanitization for LLM API Calls icereed/paperless-gpt#917 / issue Feature Request: Content Sanitization for LLM API Calls icereed/paperless-gpt#916: sanitize LLM API payloads
Run auto-tagging only if no documents found by OCR icereed/paperless-gpt#918: skip auto-tagging when OCR found work in the current cycle
feat(ocr): add per-document prompt rendering with existing content support icereed/paperless-gpt#925 / issue ocr-prompt should have a {{.Content}} variable with existing OCR icereed/paperless-gpt#882: per-document OCR prompt rendering with existing content

Local follow-up on top of upstream:

implement the core #926 OCR-then-classify chaining path (AUTO_OCR_THEN_CLASSIFY=true)
add OLLAMA_THINK and VISION_OLLAMA_THINK to explicitly enable/disable Ollama thinking mode (#931)
preserve the OCR-complete tag when chaining classification after OCR

Validation:

go build .
go test ./sanitize
go test ./ocr -run "^(TestOllamaThinkingCallOptions_Disabled|TestLLMProvider_WithPrompt|TestRenderOCRPrompt|TestResolveVisionOllamaHost)$"

Known unrelated baseline issues in this checkout:

root package tests fail to compile in pre-existing app_llm_test.go / paperless_test.go
full go test ./ocr still fails on the existing OpenAI-compatible no-key test

Introduce sanitize package to strip configured patterns from content before sending to LLMs. Supports literal string removal via REMOVE_FROM_CONTENT env var and regex pattern removal via REMOVE_FROM_CONTENT_REGEX env var. Apply sanitization to document content processing and OCR prompts to prevent sensitive information from being sent to external LLM APIs.

Content Sanitization Feature

This is to help prevent a task-switching overhead that comes from the potentially long startup times of models.

Red phase: tests for SetPrompt/GetPrompt on LLMProvider and renderOCRPrompt function that renders OCR templates per-document with existing document content. Ref: icereed#882

Moves OCR template rendering from startup-only to per-document, following the same pattern used by title, tag, correspondent, and all other prompt templates. Enables the OCR prompt to reference the document's existing text via {{.Content}}. Changes: - Add renderOCRPrompt() for per-document template rendering - Add SetPrompt/GetPrompt on LLMProvider for per-document overrides - Add ExistingContent to OCROptions, passed from background processor - Update default ocr_prompt.tmpl with conditional {{.Content}} block - Startup rendering passes Content="" as fallback (backward compatible) Templates without {{.Content}} work identically to before. Users with custom prompts opt in by adding {{if .Content}}...{{end}} to their template. Ref: icereed#882

- Replace SetPrompt with WithPrompt to avoid mutating the shared ocrProvider singleton (concurrency safety) - Cap existing content to 8000 chars before injecting into OCR prompt to avoid blowing vision model context on long documents - Tighten language test assertion with t.Setenv and assert.Equal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…m debug log - Remove empty-string guard so intentionally-empty templates take effect - Log prompt length instead of full content in debug output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s empty

…isting content

BieggerM and others added 16 commits March 4, 2026 21:51

Fix Docker build: copy sanitize package to build context

770eac1

Fix: sanitize document content in custom fields processing

97f3db8

Fix: sanitize document content in ad-hoc analysis

8d1d95a

Merge pull request #1 from BieggerM/sanitize

1dae1cd

Content Sanitization Feature

Run auto-tagging only if no documents found by OCR

71798bc

This is to help prevent a task-switching overhead that comes from the potentially long startup times of models.

fix(sanitize): reset initErr in tests to prevent state leakage

fccd7e9

fix(sanitize): add defensive init check to fail closed on error

c688344

test(ocr): add tests for per-document OCR prompt rendering

f54ee41

Red phase: tests for SetPrompt/GetPrompt on LLMProvider and renderOCRPrompt function that renders OCR templates per-document with existing document content. Ref: icereed#882

fix(ocr): apply rendered prompt unconditionally and redact prompt fro…

50d2eb4

…m debug log - Remove empty-string guard so intentionally-empty templates take effect - Log prompt length instead of full content in debug output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge upstream PR icereed#917: sanitize LLM API payloads

6dfdb54

Merge upstream PR icereed#918: run auto-tagging only when OCR queue i…

d7f8584

…s empty

Merge upstream PR icereed#925: render OCR prompt per document with ex…

74ac656

…isting content

feat(ocr): chain classification and support ollama think toggles

7a41eb9

ivanzud merged commit b3928fe into main Mar 8, 2026
0 of 12 checks passed

ivanzud deleted the integrate/upstream-wave-20260308b branch March 8, 2026 17:19

This was referenced Mar 8, 2026

[UPSTREAM #916] Feature Request: Content Sanitization for LLM API Calls #132

Closed

[UPSTREAM #931] Feature Request: Implement parameter to disable thinking #138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve unresolved upstream March OCR/Ollama items#131

Resolve unresolved upstream March OCR/Ollama items#131
ivanzud merged 16 commits into
mainfrom
integrate/upstream-wave-20260308b

ivanzud commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ivanzud commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants