You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
paperless-gpt's auto-tagging path (processAutoTagDocuments in the background task) forwards every LLM-suggested value to Paperless via PATCH without intermediate validation. If Paperless rejects the request with a 4xx — for any reason — the auto-tag path exits with an error and leaves AUTO_TAG (paperless-gpt-auto) on the document. The background poller picks the document up again ~9 s later, re-runs the full ~21 s LLM pipeline (6 calls on OpenAI), gets the same rejection, and loops indefinitely.
This is the LLM-side equivalent of #949 (which addresses the same shape on the OCR path). The defect is architectural and type-agnostic: any Paperless-validated field type can trigger it (Date, Select, Monetary, Integer, Float, URL, Boolean, oversize String). Related issues that are the same underlying bug under different triggers:
(this issue) — auto-tagging path, Date-type custom field with out-of-range value from LLM
Environment
paperless-gpt: v0.25.1 (image: ghcr.io/icereed/paperless-gpt:latest, built 2026-02-26)
paperless-ngx: 2.20 (latest)
LLM provider: OpenAI, model gpt-5.4-mini, LLM_TEMPERATURE=1.0
AUTO_TAG: paperless-gpt-auto (default)
Polling interval: ~9 s (default)
Reproducer (the simplest of many)
In Paperless, create a custom field of type Date and include its id in paperless-gpt's custom_fields_selected_ids so the LLM tries to fill it.
Add a document whose OCR text contains a date with an out-of-range day or month — for example the literal string Datum 79.01.2023 (easy to provoke by hand-typing such a string into a test PDF before submitting).
Add paperless-gpt-auto to the document.
Other validated field types reproduce the same loop with different triggers — e.g. #956 demonstrates it for type Select with a freetext LLM output. A Monetary field fed a non-numeric value, an Integer field fed a string, a URL field fed something without a scheme, etc., would all behave the same way.
Observed
LLM emits a value that Paperless's serializer rejects (in the reproducer: the literal ISO string 2023-01-79).
paperless-gpt PATCHes Paperless. Paperless replies 400, e.g. (for the Date case):
400 {"custom_fields":[...,{"non_field_errors":[
"Date has wrong format. Use one of these formats instead: YYYY-MM-DD."]},...]}
paperless-gpt logs:
level=error msg="Error in background tagging: error in processAutoTagDocuments:
error updating document N: ..."
AUTO_TAG is NOT removed. Next poll re-runs the full LLM pipeline.
Paperless's audit log confirms a 94-minute observed loop on a single document in our environment: ~270 cycles × ~6 LLM calls ≈ ~1,600 billed calls.
Expected
Symmetric to #949 on the OCR path: on any failure exiting processAutoTagDocuments — regardless of where it failed (LLM call, JSON parse, Paperless PATCH 4xx, etc.) — the AUTO_TAG should be swapped to a configurable failure tag (default paperless-gpt-failed), so the loop is broken after one wasted cycle and failed documents are easy for the user to find and re-process manually.
Suggested fix
Mirror #949's pattern in the LLM auto-tagging path. The fix is structural and type-agnostic — it does not need to know which field validation tripped; it only needs to react to any error exit from processAutoTagDocuments:
Add FAIL_TAG env var (default: paperless-gpt-failed), validated and exported.
In background.go (or wherever processAutoTagDocuments exits on error), call UpdateDocuments to swap AUTO_TAG → FAIL_TAG before continuing.
A complementary improvement (separate PR; out of scope for this one) would be to validate LLM output against the destination Paperless schema before PATCHing, and drop fields that won't pass validation rather than discard the whole document update over one bad value. But the loop-break is the essential fix.
Happy to submit a PR if useful. Either way, I'd keep this fix separate from #944 (queue rewrite) since it's a small, focused change that can land independently.
Summary
paperless-gpt's auto-tagging path (
processAutoTagDocumentsin the background task) forwards every LLM-suggested value to Paperless via PATCH without intermediate validation. If Paperless rejects the request with a 4xx — for any reason — the auto-tag path exits with an error and leavesAUTO_TAG(paperless-gpt-auto) on the document. The background poller picks the document up again ~9 s later, re-runs the full ~21 s LLM pipeline (6 calls on OpenAI), gets the same rejection, and loops indefinitely.This is the LLM-side equivalent of #949 (which addresses the same shape on the OCR path). The defect is architectural and type-agnostic: any Paperless-validated field type can trigger it (Date, Select, Monetary, Integer, Float, URL, Boolean, oversize String). Related issues that are the same underlying bug under different triggers:
Environment
ghcr.io/icereed/paperless-gpt:latest, built 2026-02-26)gpt-5.4-mini,LLM_TEMPERATURE=1.0AUTO_TAG:paperless-gpt-auto(default)Reproducer (the simplest of many)
custom_fields_selected_idsso the LLM tries to fill it.Datum 79.01.2023(easy to provoke by hand-typing such a string into a test PDF before submitting).paperless-gpt-autoto the document.Other validated field types reproduce the same loop with different triggers — e.g. #956 demonstrates it for type Select with a freetext LLM output. A Monetary field fed a non-numeric value, an Integer field fed a string, a URL field fed something without a scheme, etc., would all behave the same way.
Observed
LLM emits a value that Paperless's serializer rejects (in the reproducer: the literal ISO string
2023-01-79).paperless-gpt PATCHes Paperless. Paperless replies 400, e.g. (for the Date case):
paperless-gpt logs:
AUTO_TAGis NOT removed. Next poll re-runs the full LLM pipeline.Paperless's audit log confirms a 94-minute observed loop on a single document in our environment: ~270 cycles × ~6 LLM calls ≈ ~1,600 billed calls.
Expected
Symmetric to #949 on the OCR path: on any failure exiting
processAutoTagDocuments— regardless of where it failed (LLM call, JSON parse, Paperless PATCH 4xx, etc.) — theAUTO_TAGshould be swapped to a configurable failure tag (defaultpaperless-gpt-failed), so the loop is broken after one wasted cycle and failed documents are easy for the user to find and re-process manually.Suggested fix
Mirror #949's pattern in the LLM auto-tagging path. The fix is structural and type-agnostic — it does not need to know which field validation tripped; it only needs to react to any error exit from
processAutoTagDocuments:FAIL_TAGenv var (default:paperless-gpt-failed), validated and exported.background.go(or whereverprocessAutoTagDocumentsexits on error), callUpdateDocumentsto swapAUTO_TAG → FAIL_TAGbefore continuing.TestProcessAutoOcrTagDocuments_FailureRemovesTag) for the AUTO_TAG path; include at least one case per failure class (PATCH 4xx, LLM error, JSON parse error).A complementary improvement (separate PR; out of scope for this one) would be to validate LLM output against the destination Paperless schema before PATCHing, and drop fields that won't pass validation rather than discard the whole document update over one bad value. But the loop-break is the essential fix.
Happy to submit a PR if useful. Either way, I'd keep this fix separate from #944 (queue rewrite) since it's a small, focused change that can land independently.