Resolve unresolved upstream March OCR/Ollama items#131
Merged
Conversation
Introduce sanitize package to strip configured patterns from content before sending to LLMs. Supports literal string removal via REMOVE_FROM_CONTENT env var and regex pattern removal via REMOVE_FROM_CONTENT_REGEX env var. Apply sanitization to document content processing and OCR prompts to prevent sensitive information from being sent to external LLM APIs.
Content Sanitization Feature
This is to help prevent a task-switching overhead that comes from the potentially long startup times of models.
Red phase: tests for SetPrompt/GetPrompt on LLMProvider and renderOCRPrompt function that renders OCR templates per-document with existing document content. Ref: icereed#882
Moves OCR template rendering from startup-only to per-document,
following the same pattern used by title, tag, correspondent, and
all other prompt templates. Enables the OCR prompt to reference the
document's existing text via {{.Content}}.
Changes:
- Add renderOCRPrompt() for per-document template rendering
- Add SetPrompt/GetPrompt on LLMProvider for per-document overrides
- Add ExistingContent to OCROptions, passed from background processor
- Update default ocr_prompt.tmpl with conditional {{.Content}} block
- Startup rendering passes Content="" as fallback (backward compatible)
Templates without {{.Content}} work identically to before. Users with
custom prompts opt in by adding {{if .Content}}...{{end}} to their
template.
Ref: icereed#882
- Replace SetPrompt with WithPrompt to avoid mutating the shared ocrProvider singleton (concurrency safety) - Cap existing content to 8000 chars before injecting into OCR prompt to avoid blowing vision model context on long documents - Tighten language test assertion with t.Setenv and assert.Equal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…m debug log - Remove empty-string guard so intentionally-empty templates take effect - Log prompt length instead of full content in debug output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Mar 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This batch handles only upstream items that were still unresolved on
ivanzud/paperless-gptafter PR #130.Merged upstream work:
Local follow-up on top of upstream:
#926OCR-then-classify chaining path (AUTO_OCR_THEN_CLASSIFY=true)OLLAMA_THINKandVISION_OLLAMA_THINKto explicitly enable/disable Ollama thinking mode (#931)Validation:
go build .go test ./sanitizego test ./ocr -run "^(TestOllamaThinkingCallOptions_Disabled|TestLLMProvider_WithPrompt|TestRenderOCRPrompt|TestResolveVisionOllamaHost)$"Known unrelated baseline issues in this checkout:
app_llm_test.go/paperless_test.gogo test ./ocrstill fails on the existing OpenAI-compatible no-key test