fix: break auto-tag loop on update rejection; salvage valid fields#976
fix: break auto-tag loop on update rejection; salvage valid fields#976thu1971dlr wants to merge 2 commits into
Conversation
When paperless-ngx rejected an UpdateDocuments PATCH with a 400
(e.g. an LLM-suggested value that fails server-side validation,
such as a malformed date like "2023-01-79"), paperless-gpt would:
- Log the error
- NOT remove the auto tag
- Return, leaving the next poll cycle to re-run the full LLM
pipeline against the same document — indefinitely.
For documents tagged with paperless-gpt-auto and processed via a
paid LLM, this billed fresh LLM calls (~6 per cycle, every ~9s)
until the tag was manually removed.
This change introduces:
1. FAIL_TAG env var (default: paperless-gpt-failed). Auto-created
in paperless-ngx at startup so the user doesn't have to.
2. Strip-and-retry in UpdateDocuments: on a 400, parse the
validation response, identify the rejected fields/custom_field
entries, drop them, and retry. The valid fields land instead
of being discarded with the bad ones. Bounded to 3 retries.
3. PartialUpdateError sentinel so the caller can apply FAIL_TAG
to a document whose update partially succeeded.
4. recoverFromFailedUpdate helper: when an update cannot be
salvaged (unparseable response, tag-related errors, retries
exhausted), explicitly remove the auto tag and add FAIL_TAG
in a tag-only PATCH. The same handling applies to the OCR
auto-tag pipeline.
Adds unit tests for the validation-error parser, the strip
helper, and the loop-break recovery path.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughImplements PATCH retry that strips rejected fields on Paperless-NGX validation 400s, returns PartialUpdateError for partial successes, adds background recovery to remove auto-tags and optionally apply a fail tag, ensures fail-tag exists at startup, and documents FAIL_TAG. ChangesPartial Update Recovery Flow
Sequence Diagram(s)sequenceDiagram
participant Processor as processAutoTagDocuments
participant Paperless as PaperlessClient.UpdateDocuments
participant Parser as parsePaperlessValidationErrors
participant Stripper as stripFailedFields
participant Recovery as applyFailTagAfterPartialSuccess / recoverFromFailedUpdate
Processor->>Paperless: PATCH document
Paperless-->>Processor: HTTP 400 (validation errors) / Error
alt HTTP 400
Processor->>Parser: parse 400 body
Parser-->>Processor: failed scalar fields + custom_fields indices
Processor->>Stripper: remove failed fields from payload
Stripper-->>Processor: stripped payload
Processor->>Paperless: retry PATCH
Paperless-->>Processor: Success (maybe partial)
Processor->>Recovery: applyFailTagAfterPartialSuccess (if partial)
else Hard Error
Processor->>Recovery: recoverFromFailedUpdate (remove auto-tag, maybe add failTag)
Recovery->>Paperless: PATCH tags
Paperless-->>Recovery: OK / best-effort
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@paperless.go`:
- Around line 728-736: When you call stripFailedFields and remove rejected keys
from updatedFields (newlyDropped), also remove those same keys from
originalFields so the post-retry modification-history loop doesn't think dropped
fields were changed to nil; specifically, after computing newlyDropped and
appending to partialDroppedFields, iterate newlyDropped and delete each key from
originalFields (or keep originalFields aligned with updatedFields) so later code
that reads originalFields and writes updatedFields[field] records only real
changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 98c0c319-d148-47c8-83e8-443220105ce8
📒 Files selected for processing (7)
README.mdbackground.gobackground_test.gomain.gopaperless.gopaperless_test.gotypes.go
| newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails) | ||
| if len(newlyDropped) == 0 { | ||
| // Paperless reported errors but they don't match anything in our | ||
| // current payload — defensive guard against parser/format drift. | ||
| return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes)) | ||
| } | ||
|
|
||
| partialDroppedFields = append(partialDroppedFields, newlyDropped...) | ||
| log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes)) |
There was a problem hiding this comment.
Keep originalFields synchronized when retry stripping drops fields.
At Line 728, rejected fields are removed from updatedFields, but originalFields is left unchanged. Later (Line 812+), modification history iterates originalFields and writes updatedFields[field], so dropped fields are recorded as changed to <nil> even though they were never applied.
Suggested fix
newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails)
if len(newlyDropped) == 0 {
// Paperless reported errors but they don't match anything in our
// current payload — defensive guard against parser/format drift.
return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes))
}
partialDroppedFields = append(partialDroppedFields, newlyDropped...)
+ // Keep modification-history source maps aligned with fields
+ // that are still actually being patched.
+ for field := range scalarFails {
+ delete(originalFields, field)
+ }
+ if _, stillPresent := updatedFields["custom_fields"]; !stillPresent {
+ delete(originalFields, "custom_fields")
+ }
log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails) | |
| if len(newlyDropped) == 0 { | |
| // Paperless reported errors but they don't match anything in our | |
| // current payload — defensive guard against parser/format drift. | |
| return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes)) | |
| } | |
| partialDroppedFields = append(partialDroppedFields, newlyDropped...) | |
| log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes)) | |
| newlyDropped := stripFailedFields(updatedFields, scalarFails, cfIdxFails) | |
| if len(newlyDropped) == 0 { | |
| // Paperless reported errors but they don't match anything in our | |
| // current payload — defensive guard against parser/format drift. | |
| return fmt.Errorf("error updating document %d: %d, %s", documentID, resp.StatusCode, string(bodyBytes)) | |
| } | |
| partialDroppedFields = append(partialDroppedFields, newlyDropped...) | |
| // Keep modification-history source maps aligned with fields | |
| // that are still actually being patched. | |
| for field := range scalarFails { | |
| delete(originalFields, field) | |
| } | |
| if _, stillPresent := updatedFields["custom_fields"]; !stillPresent { | |
| delete(originalFields, "custom_fields") | |
| } | |
| log.Warnf("Document %d: paperless-ngx rejected fields %v on attempt %d/%d; retrying without them. Raw response: %s", documentID, newlyDropped, attempt+1, maxRetries+1, string(bodyBytes)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@paperless.go` around lines 728 - 736, When you call stripFailedFields and
remove rejected keys from updatedFields (newlyDropped), also remove those same
keys from originalFields so the post-retry modification-history loop doesn't
think dropped fields were changed to nil; specifically, after computing
newlyDropped and appending to partialDroppedFields, iterate newlyDropped and
delete each key from originalFields (or keep originalFields aligned with
updatedFields) so later code that reads originalFields and writes
updatedFields[field] records only real changes.
The previous regex `^\d{4}-\d{2}-\d{2}$` only validates the string
format, not whether the digits form a real calendar date. It accepted
impossible values like "2023-01-79" (day 79 does not exist), which
were passed through to paperless-ngx and rejected with a 400.
Replacing the regex with time.Parse("2006-01-02", ...) catches these
values before the PATCH is sent. The field is dropped and added to
partialDroppedFields so the caller applies the fail tag — same
user-visible outcome as the strip-and-retry path for post-PATCH
rejections, but with one fewer HTTP round-trip.
Adds a test verifying that UpdateDocuments returns a PartialUpdateError
with created_date in DroppedFields when given an impossible date, and
that the PATCH payload does not include the invalid field.
Problem
When paperless-ngx rejects an
UpdateDocumentsPATCH (HTTP 400 from its serializer), paperless-gpt does not remove the auto tag from the document. The next polling cycle picks the document up again, runs the full LLM pipeline, sends another PATCH, gets rejected again — indefinitely.Reproduction: a document whose OCR text contains a corrupted date (
Datum 79.01.2023) leads the LLM to extract2023-01-79into theFälligkeitsdatumcustom field. paperless-ngx rejects with:On a paid LLM provider the loop bills ~6 calls every ~9 seconds until the user notices and manually removes the tag.
Fix
Two layers:
Strip-and-retry. Parse the 400 body, identify which fields paperless-ngx rejected (top-level scalars + array indices into
custom_fields), remove only those, and retry the PATCH. The valid title / correspondent / tags / document_type / surviving custom fields all land. Bounded to 3 retries; in practice 1 is enough because paperless-ngx reports all errors in one body.Fail-tag marker. Whenever fields had to be dropped, apply a configurable fail tag (
FAIL_TAG, defaultpaperless-gpt-failed) so the user can find the document and complete it manually. The tag is auto-created in paperless-ngx at startup so the user doesn't need any setup.If the update can't be salvaged at all (response unparseable, errors reference fields we can't safely drop such as
tags, retry cap exhausted), the failure path is taken: remove the auto tag and add the fail tag in a tag-only PATCH. The loop is broken either way.The same handling applies to both
processAutoTagDocumentsandprocessAutoOcrTagDocuments.Test plan
parsePaperlessValidationErrorscovering real-world body, scalar-only, custom_field-only, tags-only (unrecoverable), garbage, empty response.stripFailedFieldscovering scalar drop, indexed custom_field drop, multi-index drop, full removal, out-of-range indices, no-op on absent fields.recoverFromFailedUpdatecovering happy path, emptyFAIL_TAG, recovery itself failing, OCR variant.Backwards compatibility
FAIL_TAGdefaults topaperless-gpt-failedand is auto-created at startup; existing deployments need no configuration changes. The new behaviour replaces a code path that previously caused an unbounded processing loop — strictly an improvement.Notes
Dockerfilepinsmusl-dev=1.2.5-r9which is no longer available on Alpine 3.21 (current isr11); building from this branch as-is therefore fails. That is a separate, unrelated bit-rot and should be a Renovate bump — this PR does not touch the Dockerfile.Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests