fix(process-pdf-job): discard on RuntimeError and guard failed status by ahnv · Pull Request #2453 · we-promise/sure

ahnv · 2026-06-22T18:14:48Z

What

Two small hardening changes to ProcessPdfJob:

discard_on(RuntimeError) — permanently discards the job when a RuntimeError is raised, instead of letting Sidekiq retry it 25 times over ~21 days. The discard block logs the job_id and error message so failures stay observable without spamming retries.
Guard failed status — the early-return guard that previously only covered "complete" now also covers "failed". This prevents a re-enqueued or accidentally re-triggered job from attempting to reprocess an import that already failed.

Incident context

A ProcessPdfJob has been failing and retrying since June 7. By June 22 it had reached retry_count=22. It fails every time with:

Failed to extract text from PDF: Invalid password ()
Could not convert PDF to images

This is a password-protected PDF — a deterministic, unrecoverable error. Each retry triggers PDF→image conversion via poppler/mini_magick, a known memory-leak risk on repeated failures (image buffers are allocated and abandoned). With 3 more retries pending before Sidekiq gives up, it would have continued retrying and leaking.

Why

RuntimeError in this job is almost always deterministic (e.g. invalid state, bad PDF format, wrong password). Retrying 25 times wastes resources and delays queue throughput with no chance of success.
The poppler/mini_magick PDF→image conversion path can leak memory on each failed attempt, making repeated retries actively harmful.
Without the failed guard, a duplicate enqueue or a background retry could clobber the failed status and re-run expensive AI processing on an import whose failure was already recorded and surfaced to the user.

Retry behaviour after this change

Exception	Retries
`RuntimeError` (or subclass)	0 — discarded, error logged
`ActiveJob::DeserializationError`	0 — discarded (inherited from `ApplicationJob`)
`ActiveRecord::Deadlocked`	5 — exponential backoff (inherited from `ApplicationJob`)
Any other `StandardError`	25 — Sidekiq default

Testing

Verify a job that raises RuntimeError is discarded and not retried.
Verify a job enqueued against a failed import returns early without changing status.
Existing ProcessPdfJob tests should remain green.

Summary by CodeRabbit

Bug Fixes

PDF processing now handles errors much more gracefully: errors are now permanently resolved with detailed logging for troubleshooting purposes, preventing endless retries and reducing overall system strain and unnecessary resource usage
Enhanced processing logic to prevent unnecessary reprocessing of both completed and failed PDF imports, improving system reliability, stability and overall performance

Add discard_on(RuntimeError) to drop the job immediately instead of exhausting 25 Sidekiq retries on deterministic errors. The block logs job_id and message for observability. Widen the early-return guard from status == "complete" to also cover "failed", preventing re-processing of already-failed imports.

coderabbitai · 2026-06-22T18:15:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9001ad9-5a3a-44b2-a638-bd699c37e110

📥 Commits

Reviewing files that changed from the base of the PR and between fdcd0c7 and e8e9796.

📒 Files selected for processing (1)

app/jobs/process_pdf_job.rb

📝 Walkthrough

Walkthrough

ProcessPdfJob adds a discard_on(RuntimeError) handler that logs and permanently discards the job on runtime errors. The perform method's early-exit guard is extended from checking only "complete" status to also skipping imports with "failed" status.

Changes

PDF Job Error Handling

Layer / File(s)	Summary
RuntimeError discard handler and failed-status early exit `app/jobs/process_pdf_job.rb`	Adds `discard_on(RuntimeError)` with a logging block for permanent discard, and updates `perform`'s guard to early-return when `pdf_import.status` is in `["complete", "failed"]` instead of only `"complete"`.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 Hop hop, the job now knows to stop,
When errors rise, we log and drop!
"Complete" or "failed" — skip the fuss,
No retry loops to trouble us.
A tidy warren, clean and bright! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes both main changes: adding `discard_on(RuntimeError)` and expanding the early-return guard to include failed status.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

superagent-security · 2026-06-22T18:15:16Z

Superagent didn't find any vulnerabilities or security issues in this PR.

ahnv · 2026-06-22T18:16:31Z

I think @JSONbored worked on this feature, maybe your review is also beneficial on this PR

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8e9796a73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-22T18:18:07Z

    return unless pdf_import.is_a?(PdfImport)
    return reset_processing_claim(pdf_import) unless pdf_import.pdf_uploaded?
-    return if pdf_import.status == "complete"
+    return if pdf_import.status.in?(%w[complete failed])


Keep queued retries from short-circuiting

When a non-RuntimeError is raised after processing starts, the rescue below updates the import to failed before re-raising. ActiveJob/Sidekiq then deserializes that same PdfImport on the retry, but this new guard returns immediately because the status is already failed, so transient failures such as API/Faraday errors during extraction get only the first attempt instead of the intended queued retries.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-22T18:18:07Z

 class ProcessPdfJob < ApplicationJob
  queue_as :medium_priority

+  discard_on(RuntimeError) do |job, err|


Narrow the RuntimeError discard to permanent PDF failures

This treats every bare raise "..." as non-retryable, but PdfImport#process_with_ai and #extract_transactions turn any unsuccessful provider response into exactly that by raising the response message. Since the providers also wrap API/Faraday failures into unsuccessful responses, a transient LLM outage or rate limit is now discarded after one attempt and left failed; use a narrower permanent-error class for bad PDFs/passwords.

Useful? React with 👍 / 👎.

jjmata

Targeted and clean. Two small things:

Breadth of discard_on(RuntimeError) — RuntimeError is Ruby's default error class (raise "message" produces one), so this will catch most untyped errors from PDF parsing libraries. That's probably the intent. But if any infrastructure-level error happens to be raised as a bare RuntimeError (rather than a subclass of StandardError), those would also be permanently discarded. It's worth checking what the upstream PDF/AI library raises for hard failures — if it uses a named subclass, narrowing the rescue to that class would be safer and make the discard intent more explicit.

status.in?(%w[complete failed]) guard — Good catch. This prevents a job in the failed state (set by another code path, e.g. AI processing marking it failed before the job retries) from redundantly re-processing. The original check only covered "complete", which meant a failed import could be re-attempted indefinitely if a retry was queued. This is the right fix.

Generated by Claude Code

ahnv · 2026-06-23T15:20:08Z

Hey @jjmata, based on your initial review, I dug into the actual error path and there's a subtlety worth flagging before we finalize.

The thing is, the typed error never actually makes it to the job. The provider does raise a proper named error (Provider::Openai::Error) for both hard failures - but it gets flattened to a plain string right at the PdfImport boundary:

# app/models/pdf_import.rb
unless response.success?
  error_message = response.error&.message || "Unknown PDF processing error"
  raise error_message   # raise <String> => plain RuntimeError
end

Once you raise a string, Ruby hands you back a bare RuntimeError, so all the type info is gone by the time the job catches it. Which means narrowing discard_on to Provider::Openai::Error in the job wouldn't actually do anything today - we'd have to re-raise something typed upstream first.

And that leads to the part that worries me a bit more: since every provider failure goes through that same raise error_message line, transient stuff gets lumped in with the permanent stuff. So a flaky OpenAI response ends up taking this route:

Faraday 5xx/timeout → Provider::Openai::Error → response.error → raise error_message → RuntimeError → discarded

In other words, this PR fixes the password-protected PDF case (great), but it also quietly stops retrying genuinely transient API failures, which used to get Sidekiq's 25 retries. Same fix, two very different failure modes caught in the same net.

A few ways we could go:

A - Leave it. Accept that transient API hiccups won't retry. Simplest, and honestly fine if we're treating PDF imports as best-effort that users can just re-trigger.
B - Type the boundary (my vote). Have pdf_import.rb do raise Provider::Error, error_message instead, then discard_on(Provider::Error) in the job. It's basically a two-line change, keeps retries for the transient cases, and makes the discard intent way more obvious to the next person reading it.
C - Properly split transient vs. permanent. Add a dedicated "this document can't be processed" error and only discard on that. Most correct, but also the most work for what we need right now.

On the status.in?(%w[complete failed]) guard - yeah, fully agree, that's the right call. The rescue block sets status: :failed, and the old check only looked for "complete", so a re-enqueued job would've happily re-run the whole AI pipeline on something that already failed.

I'm leaning B. Happy to fold it into this PR if that sounds right to you.

jjmata · 2026-06-25T03:41:09Z

I think B makes sense, yeah ... let's go that route! 👍

chatgpt-codex-connector Bot reviewed Jun 22, 2026

View reviewed changes

jjmata reviewed Jun 23, 2026

View reviewed changes

jjmata assigned ahnv Jun 25, 2026

jjmata added this to [v0.7.2] Current `latest` train - June 2026 Jun 25, 2026

jjmata added this to the v0.7.2 milestone Jun 25, 2026

github-project-automation Bot moved this to Todo in [v0.7.2] Current `latest` train - June 2026 Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(process-pdf-job): discard on RuntimeError and guard failed status#2453

fix(process-pdf-job): discard on RuntimeError and guard failed status#2453
ahnv wants to merge 1 commit into
we-promise:mainfrom
ahnv:fix/add-dead-letter-for-pdf-imports

ahnv commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

superagent-security Bot commented Jun 22, 2026

Uh oh!

ahnv commented Jun 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Uh oh!

jjmata left a comment

Uh oh!

ahnv commented Jun 23, 2026

Uh oh!

jjmata commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ahnv commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Incident context

Why

Retry behaviour after this change

Testing

Summary by CodeRabbit

Bug Fixes

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

superagent-security Bot commented Jun 22, 2026

Uh oh!

ahnv commented Jun 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

jjmata left a comment

Choose a reason for hiding this comment

Uh oh!

ahnv commented Jun 23, 2026

Uh oh!

jjmata commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahnv commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading