fix: break auto-tag loop on update rejection; salvage valid fields by thu1971dlr · Pull Request #976 · icereed/paperless-gpt

thu1971dlr · 2026-05-19T15:08:06Z

Problem

When paperless-ngx rejects an UpdateDocuments PATCH (HTTP 400 from its serializer), paperless-gpt does not remove the auto tag from the document. The next polling cycle picks the document up again, runs the full LLM pipeline, sends another PATCH, gets rejected again — indefinitely.

Reproduction: a document whose OCR text contains a corrupted date (Datum 79.01.2023) leads the LLM to extract 2023-01-79 into the Fälligkeitsdatum custom field. paperless-ngx rejects with:

400 {"custom_fields":[{},{},{},{},{"non_field_errors":
    ["Date has wrong format. Use one of these formats instead:
    YYYY-MM-DD."]},{},{}]}

On a paid LLM provider the loop bills ~6 calls every ~9 seconds until the user notices and manually removes the tag.

Fix

Two layers:

Strip-and-retry. Parse the 400 body, identify which fields paperless-ngx rejected (top-level scalars + array indices into custom_fields), remove only those, and retry the PATCH. The valid title / correspondent / tags / document_type / surviving custom fields all land. Bounded to 3 retries; in practice 1 is enough because paperless-ngx reports all errors in one body.
Fail-tag marker. Whenever fields had to be dropped, apply a configurable fail tag (FAIL_TAG, default paperless-gpt-failed) so the user can find the document and complete it manually. The tag is auto-created in paperless-ngx at startup so the user doesn't need any setup.

If the update can't be salvaged at all (response unparseable, errors reference fields we can't safely drop such as tags, retry cap exhausted), the failure path is taken: remove the auto tag and add the fail tag in a tag-only PATCH. The loop is broken either way.

The same handling applies to both processAutoTagDocuments and processAutoOcrTagDocuments.

Test plan

Unit tests for parsePaperlessValidationErrors covering real-world body, scalar-only, custom_field-only, tags-only (unrecoverable), garbage, empty response.
Unit tests for stripFailedFields covering scalar drop, indexed custom_field drop, multi-index drop, full removal, out-of-range indices, no-op on absent fields.
Unit tests for recoverFromFailedUpdate covering happy path, empty FAIL_TAG, recovery itself failing, OCR variant.
End-to-end against a real paperless-ngx instance: doc with OCR-corrupted date triggers 400 → strip-and-retry → all valid fields land → fail tag applied → no further polling cycles.

Backwards compatibility

FAIL_TAG defaults to paperless-gpt-failed and is auto-created at startup; existing deployments need no configuration changes. The new behaviour replaces a code path that previously caused an unbounded processing loop — strictly an improvement.

Notes

The strip-and-retry approach is deliberately data-driven from paperless-ngx's own response. paperless-gpt does not maintain a parallel copy of paperless-ngx's validation rules, so adding new field types upstream doesn't require corresponding changes here.
The current Dockerfile pins musl-dev=1.2.5-r9 which is no longer available on Alpine 3.21 (current is r11); building from this branch as-is therefore fails. That is a separate, unrelated bit-rot and should be a Renovate bump — this PR does not touch the Dockerfile.

Summary by CodeRabbit

New Features
- Added FAIL_TAG environment variable (default: paperless-gpt-failed) to mark documents when updates are rejected
- App now ensures the fail tag exists at startup
Bug Fixes
- Better recovery to avoid repeated re-processing of failed documents
- Partial-update handling improved with targeted retries that omit invalid fields
Documentation
- README updated with FAIL_TAG usage and failure scenarios
Tests
- Added tests covering recovery and partial-update behaviors

When paperless-ngx rejected an UpdateDocuments PATCH with a 400 (e.g. an LLM-suggested value that fails server-side validation, such as a malformed date like "2023-01-79"), paperless-gpt would: - Log the error - NOT remove the auto tag - Return, leaving the next poll cycle to re-run the full LLM pipeline against the same document — indefinitely. For documents tagged with paperless-gpt-auto and processed via a paid LLM, this billed fresh LLM calls (~6 per cycle, every ~9s) until the tag was manually removed. This change introduces: 1. FAIL_TAG env var (default: paperless-gpt-failed). Auto-created in paperless-ngx at startup so the user doesn't have to. 2. Strip-and-retry in UpdateDocuments: on a 400, parse the validation response, identify the rejected fields/custom_field entries, drop them, and retry. The valid fields land instead of being discarded with the bad ones. Bounded to 3 retries. 3. PartialUpdateError sentinel so the caller can apply FAIL_TAG to a document whose update partially succeeded. 4. recoverFromFailedUpdate helper: when an update cannot be salvaged (unparseable response, tag-related errors, retries exhausted), explicitly remove the auto tag and add FAIL_TAG in a tag-only PATCH. The same handling applies to the OCR auto-tag pipeline. Adds unit tests for the validation-error parser, the strip helper, and the loop-break recovery path.

coderabbitai · 2026-05-19T15:08:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59b23cd0-4b89-455a-bdd6-a198692e8398

📥 Commits

Reviewing files that changed from the base of the PR and between 21297e6 and 0292d1f.

📒 Files selected for processing (2)

paperless.go
paperless_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

paperless.go

📝 Walkthrough

Walkthrough

Implements PATCH retry that strips rejected fields on Paperless-NGX validation 400s, returns PartialUpdateError for partial successes, adds background recovery to remove auto-tags and optionally apply a fail tag, ensures fail-tag exists at startup, and documents FAIL_TAG.

Changes

Partial Update Recovery Flow

Layer / File(s)	Summary
PartialUpdateError type definition `types.go`	New exported PartialUpdateError type carries DocumentID and DroppedFields and implements the error interface; `fmt` import added.
UpdateDocuments PATCH retry with validation error parsing and field stripping `paperless.go`, `paperless_test.go`	UpdateDocuments now retries on HTTP 400 by parsing validation responses to identify rejected scalar fields and `custom_fields` indices, strips only those fields, retries up to a cap, and records the first partial-success as PartialUpdateError. Includes created_date pre-validation and tests for parsing/stripping and created_date behavior.
Tag existence verification utility `paperless.go`	New EnsureTagExists method checks for tag existence in Paperless-NGX and creates the tag if missing; no-op for empty tag names and surfaces errors.
Background recovery handlers and partial-success processing `background.go`, `background_test.go`	Adds applyFailTagAfterPartialSuccess and recoverFromFailedUpdate to best-effort apply/remove tags after partial or hard failures (without clobbering suggested tags); imports `gorm.io/gorm`. Tests validate recovery UpdateDocuments calls and non-panicking behavior.
Auto-tag processing error detection and recovery integration `background.go`	processAutoTagDocuments and processAutoOcrTagDocuments now detect PartialUpdateError via errors.As; on partial success apply failTag and count as processed; on non-partial failures call recoverFromFailedUpdate to remove triggering auto-tag and optionally apply failTag.
Startup configuration and tag creation `main.go`	Adds `failTag` env var (FAIL_TAG) defaulting to `paperless-gpt-failed`; startup calls EnsureTagExists and logs a warning on failure; validateOrDefaultEnvVars sets and prints resolved failTag.
Environment variable documentation `README.md`	FAIL_TAG documented in Docker Compose example environment block and Environment Variables table with semantics for partial rejections vs hard failures and auto-creation behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Processor as processAutoTagDocuments
  participant Paperless as PaperlessClient.UpdateDocuments
  participant Parser as parsePaperlessValidationErrors
  participant Stripper as stripFailedFields
  participant Recovery as applyFailTagAfterPartialSuccess / recoverFromFailedUpdate

  Processor->>Paperless: PATCH document
  Paperless-->>Processor: HTTP 400 (validation errors) / Error
  alt HTTP 400
    Processor->>Parser: parse 400 body
    Parser-->>Processor: failed scalar fields + custom_fields indices
    Processor->>Stripper: remove failed fields from payload
    Stripper-->>Processor: stripped payload
    Processor->>Paperless: retry PATCH
    Paperless-->>Processor: Success (maybe partial)
    Processor->>Recovery: applyFailTagAfterPartialSuccess (if partial)
  else Hard Error
    Processor->>Recovery: recoverFromFailedUpdate (remove auto-tag, maybe add failTag)
    Recovery->>Paperless: PATCH tags
    Paperless-->>Recovery: OK / best-effort
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Bug: AUTO_TAG auto-tagging loops indefinitely on any Paperless PATCH 4xx — LLM-side equivalent of #949 #975: The PR directly implements the suggested fix by adding FAIL_TAG behavior, partial-update handling, and recovery integration to avoid re-queuing loops.

Possibly related PRs

icereed/paperless-gpt#119: Related changes to UpdateDocuments tag-handling logic to preserve/merge existing tags alongside suggested tags.
icereed/paperless-gpt#321: Background processing refactor touching the same auto-tag flows; this PR layers partial-update detection and recovery on those paths.
icereed/paperless-gpt#319: Prior work on auto-tag processing and failure handling; conceptually related to re-queuing prevention.

Suggested labels

safe-to-test

Poem

🐰 I hopped through retries in the night,
Stripped the bad bits, retried just right.
When patches fail and loops ensue,
A fail-tag lands to end the queue.
Hooray for tidy, brave little bugs 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly summarizes the main changes: breaking the auto-tag reprocessing loop by handling update rejections and preserving valid fields through strip-and-retry logic.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@paperless.go`:
- Around line 728-736: When you call stripFailedFields and remove rejected keys
from updatedFields (newlyDropped), also remove those same keys from
originalFields so the post-retry modification-history loop doesn't think dropped
fields were changed to nil; specifically, after computing newlyDropped and
appending to partialDroppedFields, iterate newlyDropped and delete each key from
originalFields (or keep originalFields aligned with updatedFields) so later code
that reads originalFields and writes updatedFields[field] records only real
changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 98c0c319-d148-47c8-83e8-443220105ce8

📥 Commits

Reviewing files that changed from the base of the PR and between 94c8428 and 21297e6.

📒 Files selected for processing (7)

README.md
background.go
background_test.go
main.go
paperless.go
paperless_test.go
types.go

coderabbitai · 2026-05-19T15:12:25Z

+			newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)
+			if len(newlyDropped) == 0 {
+				// Paperless reported errors but they don't match anything in our
+				// current payload — defensive guard against parser/format drift.
+				return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))
+			}
+
+			partialDroppedFields = append(partialDroppedFields, newlyDropped...)
+			log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep originalFields synchronized when retry stripping drops fields.

At Line 728, rejected fields are removed from updatedFields, but originalFields is left unchanged. Later (Line 812+), modification history iterates originalFields and writes updatedFields[field], so dropped fields are recorded as changed to <nil> even though they were never applied.

Suggested fix

newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails) if len(newlyDropped) == 0 { // Paperless reported errors but they don't match anything in our // current payload — defensive guard against parser/format drift. return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes)) } partialDroppedFields = append(partialDroppedFields, newlyDropped...) + // Keep modification-history source maps aligned with fields + // that are still actually being patched. + for field := range scalarFails { + delete(originalFields, field) + } + if _, stillPresent := updatedFields["custom_fields"]; !stillPresent { + delete(originalFields, "custom_fields") + } log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)

if len(newlyDropped) == 0 {

// Paperless reported errors but they don't match anything in our

// current payload — defensive guard against parser/format drift.

return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))

}

partialDroppedFields = append(partialDroppedFields, newlyDropped...)

log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))

newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)

if len(newlyDropped) == 0 {

// Paperless reported errors but they don't match anything in our

// current payload — defensive guard against parser/format drift.

return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))

}

partialDroppedFields = append(partialDroppedFields, newlyDropped...)

// Keep modification-history source maps aligned with fields

// that are still actually being patched.

for field := range scalarFails {

delete(originalFields, field)

}

if _, stillPresent := updatedFields["custom_fields"]; !stillPresent {

delete(originalFields, "custom_fields")

}

log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@paperless.go` around lines 728 - 736, When you call stripFailedFields and remove rejected keys from updatedFields (newlyDropped), also remove those same keys from originalFields so the post-retry modification-history loop doesn't think dropped fields were changed to nil; specifically, after computing newlyDropped and appending to partialDroppedFields, iterate newlyDropped and delete each key from originalFields (or keep originalFields aligned with updatedFields) so later code that reads originalFields and writes updatedFields[field] records only real changes.

The previous regex `^\d{4}-\d{2}-\d{2}$` only validates the string format, not whether the digits form a real calendar date. It accepted impossible values like "2023-01-79" (day 79 does not exist), which were passed through to paperless-ngx and rejected with a 400. Replacing the regex with time.Parse("2006-01-02", ...) catches these values before the PATCH is sent. The field is dropped and added to partialDroppedFields so the caller applies the fail tag — same user-visible outcome as the strip-and-retry path for post-PATCH rejections, but with one fewer HTTP round-trip. Adds a test verifying that UpdateDocuments returns a PartialUpdateError with created_date in DroppedFields when given an impossible date, and that the PATCH payload does not include the invalid field.

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

thu1971dlr mentioned this pull request May 19, 2026

Bug: AUTO_TAG auto-tagging loops indefinitely on any Paperless PATCH 4xx — LLM-side equivalent of #949 #975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: break auto-tag loop on update rejection; salvage valid fields#976

fix: break auto-tag loop on update rejection; salvage valid fields#976
thu1971dlr wants to merge 2 commits into
icereed:mainfrom
thu1971dlr:fix-break-auto-tag-loop-on-rejection

thu1971dlr commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thu1971dlr commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test plan

Backwards compatibility

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thu1971dlr commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading