Skip to content

fix(process-pdf-job): discard on RuntimeError and guard failed status#2453

Open
ahnv wants to merge 1 commit into
we-promise:mainfrom
ahnv:fix/add-dead-letter-for-pdf-imports
Open

fix(process-pdf-job): discard on RuntimeError and guard failed status#2453
ahnv wants to merge 1 commit into
we-promise:mainfrom
ahnv:fix/add-dead-letter-for-pdf-imports

Conversation

@ahnv

@ahnv ahnv commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What

Two small hardening changes to ProcessPdfJob:

  1. discard_on(RuntimeError) — permanently discards the job when a RuntimeError is raised, instead of letting Sidekiq retry it 25 times over ~21 days. The discard block logs the job_id and error message so failures stay observable without spamming retries.

  2. Guard failed status — the early-return guard that previously only covered "complete" now also covers "failed". This prevents a re-enqueued or accidentally re-triggered job from attempting to reprocess an import that already failed.

Incident context

A ProcessPdfJob has been failing and retrying since June 7. By June 22 it had reached retry_count=22. It fails every time with:

Failed to extract text from PDF: Invalid password ()
Could not convert PDF to images

This is a password-protected PDF — a deterministic, unrecoverable error. Each retry triggers PDF→image conversion via poppler/mini_magick, a known memory-leak risk on repeated failures (image buffers are allocated and abandoned). With 3 more retries pending before Sidekiq gives up, it would have continued retrying and leaking.

Why

  • RuntimeError in this job is almost always deterministic (e.g. invalid state, bad PDF format, wrong password). Retrying 25 times wastes resources and delays queue throughput with no chance of success.
  • The poppler/mini_magick PDF→image conversion path can leak memory on each failed attempt, making repeated retries actively harmful.
  • Without the failed guard, a duplicate enqueue or a background retry could clobber the failed status and re-run expensive AI processing on an import whose failure was already recorded and surfaced to the user.

Retry behaviour after this change

Exception Retries
RuntimeError (or subclass) 0 — discarded, error logged
ActiveJob::DeserializationError 0 — discarded (inherited from ApplicationJob)
ActiveRecord::Deadlocked 5 — exponential backoff (inherited from ApplicationJob)
Any other StandardError 25 — Sidekiq default

Testing

  • Verify a job that raises RuntimeError is discarded and not retried.
  • Verify a job enqueued against a failed import returns early without changing status.
  • Existing ProcessPdfJob tests should remain green.

Summary by CodeRabbit

Bug Fixes

  • PDF processing now handles errors much more gracefully: errors are now permanently resolved with detailed logging for troubleshooting purposes, preventing endless retries and reducing overall system strain and unnecessary resource usage
  • Enhanced processing logic to prevent unnecessary reprocessing of both completed and failed PDF imports, improving system reliability, stability and overall performance

Add discard_on(RuntimeError) to drop the job immediately instead of
exhausting 25 Sidekiq retries on deterministic errors. The block logs
job_id and message for observability.

Widen the early-return guard from status == "complete" to also cover
"failed", preventing re-processing of already-failed imports.
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9001ad9-5a3a-44b2-a638-bd699c37e110

📥 Commits

Reviewing files that changed from the base of the PR and between fdcd0c7 and e8e9796.

📒 Files selected for processing (1)
  • app/jobs/process_pdf_job.rb

📝 Walkthrough

Walkthrough

ProcessPdfJob adds a discard_on(RuntimeError) handler that logs and permanently discards the job on runtime errors. The perform method's early-exit guard is extended from checking only "complete" status to also skipping imports with "failed" status.

Changes

PDF Job Error Handling

Layer / File(s) Summary
RuntimeError discard handler and failed-status early exit
app/jobs/process_pdf_job.rb
Adds discard_on(RuntimeError) with a logging block for permanent discard, and updates perform's guard to early-return when pdf_import.status is in ["complete", "failed"] instead of only "complete".

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 Hop hop, the job now knows to stop,
When errors rise, we log and drop!
"Complete" or "failed" — skip the fuss,
No retry loops to trouble us.
A tidy warren, clean and bright! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes both main changes: adding discard_on(RuntimeError) and expanding the early-return guard to include failed status.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

@ahnv

ahnv commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

I think @JSONbored worked on this feature, maybe your review is also beneficial on this PR

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8e9796a73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return unless pdf_import.is_a?(PdfImport)
return reset_processing_claim(pdf_import) unless pdf_import.pdf_uploaded?
return if pdf_import.status == "complete"
return if pdf_import.status.in?(%w[complete failed])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep queued retries from short-circuiting

When a non-RuntimeError is raised after processing starts, the rescue below updates the import to failed before re-raising. ActiveJob/Sidekiq then deserializes that same PdfImport on the retry, but this new guard returns immediately because the status is already failed, so transient failures such as API/Faraday errors during extraction get only the first attempt instead of the intended queued retries.

Useful? React with 👍 / 👎.

class ProcessPdfJob < ApplicationJob
queue_as :medium_priority

discard_on(RuntimeError) do |job, err|

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow the RuntimeError discard to permanent PDF failures

This treats every bare raise "..." as non-retryable, but PdfImport#process_with_ai and #extract_transactions turn any unsuccessful provider response into exactly that by raising the response message. Since the providers also wrap API/Faraday failures into unsuccessful responses, a transient LLM outage or rate limit is now discarded after one attempt and left failed; use a narrower permanent-error class for bad PDFs/passwords.

Useful? React with 👍 / 👎.

@jjmata jjmata left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targeted and clean. Two small things:

Breadth of discard_on(RuntimeError)RuntimeError is Ruby's default error class (raise "message" produces one), so this will catch most untyped errors from PDF parsing libraries. That's probably the intent. But if any infrastructure-level error happens to be raised as a bare RuntimeError (rather than a subclass of StandardError), those would also be permanently discarded. It's worth checking what the upstream PDF/AI library raises for hard failures — if it uses a named subclass, narrowing the rescue to that class would be safer and make the discard intent more explicit.

status.in?(%w[complete failed]) guard — Good catch. This prevents a job in the failed state (set by another code path, e.g. AI processing marking it failed before the job retries) from redundantly re-processing. The original check only covered "complete", which meant a failed import could be re-attempted indefinitely if a retry was queued. This is the right fix.


Generated by Claude Code

@ahnv

ahnv commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Hey @jjmata, based on your initial review, I dug into the actual error path and there's a subtlety worth flagging before we finalize.

The thing is, the typed error never actually makes it to the job. The provider does raise a proper named error (Provider::Openai::Error) for both hard failures - but it gets flattened to a plain string right at the PdfImport boundary:

# app/models/pdf_import.rb
unless response.success?
  error_message = response.error&.message || "Unknown PDF processing error"
  raise error_message   # raise <String> => plain RuntimeError
end

Once you raise a string, Ruby hands you back a bare RuntimeError, so all the type info is gone by the time the job catches it. Which means narrowing discard_on to Provider::Openai::Error in the job wouldn't actually do anything today - we'd have to re-raise something typed upstream first.

And that leads to the part that worries me a bit more: since every provider failure goes through that same raise error_message line, transient stuff gets lumped in with the permanent stuff. So a flaky OpenAI response ends up taking this route:

Faraday 5xx/timeoutProvider::Openai::Errorresponse.errorraise error_messageRuntimeErrordiscarded

In other words, this PR fixes the password-protected PDF case (great), but it also quietly stops retrying genuinely transient API failures, which used to get Sidekiq's 25 retries. Same fix, two very different failure modes caught in the same net.

A few ways we could go:

  • A - Leave it. Accept that transient API hiccups won't retry. Simplest, and honestly fine if we're treating PDF imports as best-effort that users can just re-trigger.
  • B - Type the boundary (my vote). Have pdf_import.rb do raise Provider::Error, error_message instead, then discard_on(Provider::Error) in the job. It's basically a two-line change, keeps retries for the transient cases, and makes the discard intent way more obvious to the next person reading it.
  • C - Properly split transient vs. permanent. Add a dedicated "this document can't be processed" error and only discard on that. Most correct, but also the most work for what we need right now.

On the status.in?(%w[complete failed]) guard - yeah, fully agree, that's the right call. The rescue block sets status: :failed, and the old check only looked for "complete", so a re-enqueued job would've happily re-run the whole AI pipeline on something that already failed.

I'm leaning B. Happy to fold it into this PR if that sounds right to you.

@jjmata

jjmata commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

I think B makes sense, yeah ... let's go that route! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants