Skip to content

fix: break auto-tag loop on update rejection; salvage valid fields#976

Open
thu1971dlr wants to merge 2 commits into
icereed:mainfrom
thu1971dlr:fix-break-auto-tag-loop-on-rejection
Open

fix: break auto-tag loop on update rejection; salvage valid fields#976
thu1971dlr wants to merge 2 commits into
icereed:mainfrom
thu1971dlr:fix-break-auto-tag-loop-on-rejection

Conversation

@thu1971dlr

@thu1971dlr thu1971dlr commented May 19, 2026

Copy link
Copy Markdown

Problem

When paperless-ngx rejects an UpdateDocuments PATCH (HTTP 400 from its serializer), paperless-gpt does not remove the auto tag from the document. The next polling cycle picks the document up again, runs the full LLM pipeline, sends another PATCH, gets rejected again — indefinitely.

Reproduction: a document whose OCR text contains a corrupted date (Datum 79.01.2023) leads the LLM to extract 2023-01-79 into the Fälligkeitsdatum custom field. paperless-ngx rejects with:

400 {"custom_fields":[{},{},{},{},{"non_field_errors":
    ["Date has wrong format. Use one of these formats instead:
    YYYY-MM-DD."]},{},{}]}

On a paid LLM provider the loop bills ~6 calls every ~9 seconds until the user notices and manually removes the tag.

Fix

Two layers:

  1. Strip-and-retry. Parse the 400 body, identify which fields paperless-ngx rejected (top-level scalars + array indices into custom_fields), remove only those, and retry the PATCH. The valid title / correspondent / tags / document_type / surviving custom fields all land. Bounded to 3 retries; in practice 1 is enough because paperless-ngx reports all errors in one body.

  2. Fail-tag marker. Whenever fields had to be dropped, apply a configurable fail tag (FAIL_TAG, default paperless-gpt-failed) so the user can find the document and complete it manually. The tag is auto-created in paperless-ngx at startup so the user doesn't need any setup.

If the update can't be salvaged at all (response unparseable, errors reference fields we can't safely drop such as tags, retry cap exhausted), the failure path is taken: remove the auto tag and add the fail tag in a tag-only PATCH. The loop is broken either way.

The same handling applies to both processAutoTagDocuments and processAutoOcrTagDocuments.

Test plan

  • Unit tests for parsePaperlessValidationErrors covering real-world body, scalar-only, custom_field-only, tags-only (unrecoverable), garbage, empty response.
  • Unit tests for stripFailedFields covering scalar drop, indexed custom_field drop, multi-index drop, full removal, out-of-range indices, no-op on absent fields.
  • Unit tests for recoverFromFailedUpdate covering happy path, empty FAIL_TAG, recovery itself failing, OCR variant.
  • End-to-end against a real paperless-ngx instance: doc with OCR-corrupted date triggers 400 → strip-and-retry → all valid fields land → fail tag applied → no further polling cycles.

Backwards compatibility

FAIL_TAG defaults to paperless-gpt-failed and is auto-created at startup; existing deployments need no configuration changes. The new behaviour replaces a code path that previously caused an unbounded processing loop — strictly an improvement.

Notes

  • The strip-and-retry approach is deliberately data-driven from paperless-ngx's own response. paperless-gpt does not maintain a parallel copy of paperless-ngx's validation rules, so adding new field types upstream doesn't require corresponding changes here.
  • The current Dockerfile pins musl-dev=1.2.5-r9 which is no longer available on Alpine 3.21 (current is r11); building from this branch as-is therefore fails. That is a separate, unrelated bit-rot and should be a Renovate bump — this PR does not touch the Dockerfile.

Summary by CodeRabbit

  • New Features

    • Added FAIL_TAG environment variable (default: paperless-gpt-failed) to mark documents when updates are rejected
    • App now ensures the fail tag exists at startup
  • Bug Fixes

    • Better recovery to avoid repeated re-processing of failed documents
    • Partial-update handling improved with targeted retries that omit invalid fields
  • Documentation

    • README updated with FAIL_TAG usage and failure scenarios
  • Tests

    • Added tests covering recovery and partial-update behaviors

Review Change Stack

When paperless-ngx rejected an UpdateDocuments PATCH with a 400
(e.g. an LLM-suggested value that fails server-side validation,
such as a malformed date like "2023-01-79"), paperless-gpt would:
  - Log the error
  - NOT remove the auto tag
  - Return, leaving the next poll cycle to re-run the full LLM
    pipeline against the same document — indefinitely.

For documents tagged with paperless-gpt-auto and processed via a
paid LLM, this billed fresh LLM calls (~6 per cycle, every ~9s)
until the tag was manually removed.

This change introduces:

1. FAIL_TAG env var (default: paperless-gpt-failed). Auto-created
   in paperless-ngx at startup so the user doesn't have to.

2. Strip-and-retry in UpdateDocuments: on a 400, parse the
   validation response, identify the rejected fields/custom_field
   entries, drop them, and retry. The valid fields land instead
   of being discarded with the bad ones. Bounded to 3 retries.

3. PartialUpdateError sentinel so the caller can apply FAIL_TAG
   to a document whose update partially succeeded.

4. recoverFromFailedUpdate helper: when an update cannot be
   salvaged (unparseable response, tag-related errors, retries
   exhausted), explicitly remove the auto tag and add FAIL_TAG
   in a tag-only PATCH. The same handling applies to the OCR
   auto-tag pipeline.

Adds unit tests for the validation-error parser, the strip
helper, and the loop-break recovery path.
@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59b23cd0-4b89-455a-bdd6-a198692e8398

📥 Commits

Reviewing files that changed from the base of the PR and between 21297e6 and 0292d1f.

📒 Files selected for processing (2)
  • paperless.go
  • paperless_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • paperless.go

📝 Walkthrough

Walkthrough

Implements PATCH retry that strips rejected fields on Paperless-NGX validation 400s, returns PartialUpdateError for partial successes, adds background recovery to remove auto-tags and optionally apply a fail tag, ensures fail-tag exists at startup, and documents FAIL_TAG.

Changes

Partial Update Recovery Flow

Layer / File(s) Summary
PartialUpdateError type definition
types.go
New exported PartialUpdateError type carries DocumentID and DroppedFields and implements the error interface; fmt import added.
UpdateDocuments PATCH retry with validation error parsing and field stripping
paperless.go, paperless_test.go
UpdateDocuments now retries on HTTP 400 by parsing validation responses to identify rejected scalar fields and custom_fields indices, strips only those fields, retries up to a cap, and records the first partial-success as PartialUpdateError. Includes created_date pre-validation and tests for parsing/stripping and created_date behavior.
Tag existence verification utility
paperless.go
New EnsureTagExists method checks for tag existence in Paperless-NGX and creates the tag if missing; no-op for empty tag names and surfaces errors.
Background recovery handlers and partial-success processing
background.go, background_test.go
Adds applyFailTagAfterPartialSuccess and recoverFromFailedUpdate to best-effort apply/remove tags after partial or hard failures (without clobbering suggested tags); imports gorm.io/gorm. Tests validate recovery UpdateDocuments calls and non-panicking behavior.
Auto-tag processing error detection and recovery integration
background.go
processAutoTagDocuments and processAutoOcrTagDocuments now detect PartialUpdateError via errors.As; on partial success apply failTag and count as processed; on non-partial failures call recoverFromFailedUpdate to remove triggering auto-tag and optionally apply failTag.
Startup configuration and tag creation
main.go
Adds failTag env var (FAIL_TAG) defaulting to paperless-gpt-failed; startup calls EnsureTagExists and logs a warning on failure; validateOrDefaultEnvVars sets and prints resolved failTag.
Environment variable documentation
README.md
FAIL_TAG documented in Docker Compose example environment block and Environment Variables table with semantics for partial rejections vs hard failures and auto-creation behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Processor as processAutoTagDocuments
  participant Paperless as PaperlessClient.UpdateDocuments
  participant Parser as parsePaperlessValidationErrors
  participant Stripper as stripFailedFields
  participant Recovery as applyFailTagAfterPartialSuccess / recoverFromFailedUpdate

  Processor->>Paperless: PATCH document
  Paperless-->>Processor: HTTP 400 (validation errors) / Error
  alt HTTP 400
    Processor->>Parser: parse 400 body
    Parser-->>Processor: failed scalar fields + custom_fields indices
    Processor->>Stripper: remove failed fields from payload
    Stripper-->>Processor: stripped payload
    Processor->>Paperless: retry PATCH
    Paperless-->>Processor: Success (maybe partial)
    Processor->>Recovery: applyFailTagAfterPartialSuccess (if partial)
  else Hard Error
    Processor->>Recovery: recoverFromFailedUpdate (remove auto-tag, maybe add failTag)
    Recovery->>Paperless: PATCH tags
    Paperless-->>Recovery: OK / best-effort
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • icereed/paperless-gpt#119: Related changes to UpdateDocuments tag-handling logic to preserve/merge existing tags alongside suggested tags.
  • icereed/paperless-gpt#321: Background processing refactor touching the same auto-tag flows; this PR layers partial-update detection and recovery on those paths.
  • icereed/paperless-gpt#319: Prior work on auto-tag processing and failure handling; conceptually related to re-queuing prevention.

Suggested labels

safe-to-test

Poem

🐰 I hopped through retries in the night,
Stripped the bad bits, retried just right.
When patches fail and loops ensue,
A fail-tag lands to end the queue.
Hooray for tidy, brave little bugs 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main changes: breaking the auto-tag reprocessing loop by handling update rejections and preserving valid fields through strip-and-retry logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@paperless.go`:
- Around line 728-736: When you call stripFailedFields and remove rejected keys
from updatedFields (newlyDropped), also remove those same keys from
originalFields so the post-retry modification-history loop doesn't think dropped
fields were changed to nil; specifically, after computing newlyDropped and
appending to partialDroppedFields, iterate newlyDropped and delete each key from
originalFields (or keep originalFields aligned with updatedFields) so later code
that reads originalFields and writes updatedFields[field] records only real
changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 98c0c319-d148-47c8-83e8-443220105ce8

📥 Commits

Reviewing files that changed from the base of the PR and between 94c8428 and 21297e6.

📒 Files selected for processing (7)
  • README.md
  • background.go
  • background_test.go
  • main.go
  • paperless.go
  • paperless_test.go
  • types.go

Comment thread paperless.go
Comment on lines +728 to +736
newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)
if len(newlyDropped) == 0 {
// Paperless reported errors but they don't match anything in our
// current payload — defensive guard against parser/format drift.
return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))
}

partialDroppedFields = append(partialDroppedFields, newlyDropped...)
log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep originalFields synchronized when retry stripping drops fields.

At Line 728, rejected fields are removed from updatedFields, but originalFields is left unchanged. Later (Line 812+), modification history iterates originalFields and writes updatedFields[field], so dropped fields are recorded as changed to <nil> even though they were never applied.

Suggested fix
 			newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)
 			if len(newlyDropped) == 0 {
 				// Paperless reported errors but they don't match anything in our
 				// current payload — defensive guard against parser/format drift.
 				return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))
 			}

 			partialDroppedFields = append(partialDroppedFields, newlyDropped...)
+			// Keep modification-history source maps aligned with fields
+			// that are still actually being patched.
+			for field := range scalarFails {
+				delete(originalFields, field)
+			}
+			if _, stillPresent := updatedFields["custom_fields"]; !stillPresent {
+				delete(originalFields, "custom_fields")
+			}
 			log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)
if len(newlyDropped) == 0 {
// Paperless reported errors but they don't match anything in our
// current payload — defensive guard against parser/format drift.
return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))
}
partialDroppedFields = append(partialDroppedFields, newlyDropped...)
log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))
newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)
if len(newlyDropped) == 0 {
// Paperless reported errors but they don't match anything in our
// current payload — defensive guard against parser/format drift.
return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))
}
partialDroppedFields = append(partialDroppedFields, newlyDropped...)
// Keep modification-history source maps aligned with fields
// that are still actually being patched.
for field := range scalarFails {
delete(originalFields, field)
}
if _, stillPresent := updatedFields["custom_fields"]; !stillPresent {
delete(originalFields, "custom_fields")
}
log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@paperless.go` around lines 728 - 736, When you call stripFailedFields and
remove rejected keys from updatedFields (newlyDropped), also remove those same
keys from originalFields so the post-retry modification-history loop doesn't
think dropped fields were changed to nil; specifically, after computing
newlyDropped and appending to partialDroppedFields, iterate newlyDropped and
delete each key from originalFields (or keep originalFields aligned with
updatedFields) so later code that reads originalFields and writes
updatedFields[field] records only real changes.

The previous regex `^\d{4}-\d{2}-\d{2}$` only validates the string
format, not whether the digits form a real calendar date. It accepted
impossible values like "2023-01-79" (day 79 does not exist), which
were passed through to paperless-ngx and rejected with a 400.

Replacing the regex with time.Parse("2006-01-02", ...) catches these
values before the PATCH is sent. The field is dropped and added to
partialDroppedFields so the caller applies the fail tag — same
user-visible outcome as the strip-and-retry path for post-PATCH
rejections, but with one fewer HTTP round-trip.

Adds a test verifying that UpdateDocuments returns a PartialUpdateError
with created_date in DroppedFields when given an impossible date, and
that the PATCH payload does not include the invalid field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant