fix: validate OCR file upload type and size to prevent server crashes (Fixes #741) by zeroknowledge0x · Pull Request #767 · ritesh-1918/HELPDESK.AI

zeroknowledge0x · 2026-05-30T14:08:53Z

Summary

Adds server-side validation for the OCR file upload endpoint to prevent crashes from processing extremely large files or non-image file types.

Changes

Add Pydantic field_validator for image_base64 field in TicketRequest model
MIME type validation: Only allows PNG, JPEG, TIFF, and PDF (when data URI prefix is present)
File size validation: Max 10MB decoded size
Descriptive error messages: Returns clear 422 validation errors for invalid uploads

Validation Details

Parses data URI prefix (data:image/png;base64,...) to extract and validate MIME type
Decodes base64 payload and checks byte length against 10MB limit
Gracefully handles missing data URI prefix (raw base64 without type info)
Invalid base64 data is caught and returns a clear error

Testing

Upload valid PNG/JPEG → accepted
Upload 15MB image → 422 error with size message
Upload .exe as base64 → 422 error with type message
Upload invalid base64 → 422 error with "Invalid base64" message
No image → accepted (empty string passes through)

Related Issues

Fixes #741

Summary by CodeRabbit

Bug Fixes
- Enhanced image upload validation with stricter format and size enforcement
- Added support for PNG, JPEG, TIFF, and PDF image formats
- Enforced 10MB maximum file size limit for uploads
- Improved error messaging for invalid submissions

- Add Pydantic field_validator for image_base64 field in TicketRequest - Validate MIME type: only PNG, JPEG, TIFF, PDF allowed - Validate decoded file size: max 10MB - Return descriptive error messages for invalid uploads - Prevents server crashes from processing extremely large or non-image files Fixes ritesh-1918#741

vercel · 2026-05-30T14:08:58Z

@zeroknowledge0x is attempting to deploy a commit to the ritesh Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-30T14:09:05Z

Warning

Review limit reached

@unsiqasik, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 21 minutes and 20 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 275f2544-4b4b-4e74-aa41-e1028461e222

📥 Commits

Reviewing files that changed from the base of the PR and between 44b9bfd and 2cf4fa1.

📒 Files selected for processing (1)

backend/main.py

📝 Walkthrough

Walkthrough

The PR adds image base64 validation to TicketRequest via a new Pydantic field_validator. The validator accepts empty strings, strips data-URI prefixes, enforces allowed MIME types (PNG, JPEG, TIFF, PDF), strictly decodes base64, and rejects payloads exceeding 10MB.

Changes

Image Base64 Validation

Layer / File(s)	Summary
TicketRequest image_base64 validator with MIME type and size checks `backend/main.py`	Import `field_validator` from Pydantic and alias base64 module. Implement `validate_image_base64` validator that accepts empty values, strips optional data-URI prefixes, restricts MIME types to image/png, image/jpeg, image/tiff, and application/pdf, repairs base64 padding, strictly decodes base64, enforces 10MB decoded-size limit, and raises `ValueError` with specific messages for unsupported types, oversized content, or invalid encoding.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

#741: PR directly addresses missing file type and size validation in the OCR upload endpoint by adding strict MIME type enforcement and a 10MB limit to base64 image processing.

Possibly related PRs

ritesh-1918/HELPDESK.AI#523: Both PRs modify backend/main.py's TicketRequest.image_base64 handling using Pydantic field_validator to reject invalid and oversized base64 inputs.

Suggested labels

type:bug, level:intermediate

Poem

🐰 A file comes in, too big, too strange,
The validator stands guard with measured range—
Ten megabytes, no more it shall store,
Base64 strict, MIME types at core! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: implementing validation for OCR file uploads to prevent server crashes, which aligns with the core objective.
Linked Issues check	✅ Passed	The PR implementation meets all coding requirements from issue `#741`: validates MIME types (PNG, JPEG, TIFF, PDF), enforces 10MB size limit, verifies decoded base64 validity, and returns descriptive 422 errors.
Out of Scope Changes check	✅ Passed	All changes are directly related to the OCR upload validation objective; no out-of-scope modifications detected beyond the necessary field validator implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

backend/main.py (1)

106-106: ⚡ Quick win

Improve data URI parsing robustness.

The MIME extraction assumes a well-formed data URI. Consider validating the prefix format more strictly to provide clearer error messages for malformed inputs.

♻️ More robust parsing

-    mime = prefix.split(":")[1].split(";")[0] if ":" in prefix else ""
-    if mime and mime not in allowed_types:
+    # Validate data URI format: data:<mime>;base64
+    if ":" not in prefix or not prefix.startswith("data:"):
+        mime = ""
+    else:
+        mime = prefix.split(":", 1)[1].split(";")[0]
+    
+    if mime and mime not in allowed_types:
         raise ValueError(
             f"Unsupported file type '{mime}'. Allowed: PNG, JPEG, TIFF, PDF"
         )
+    elif ":" in prefix and not mime:
+        raise ValueError("Invalid data URI format: could not extract MIME type")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/main.py` at line 106, The current one-liner extracting mime from the
data URI prefix (mime = prefix.split(":")[1].split(";")[0] if ":" in prefix else
"") is fragile; update parsing in the function that handles data URIs to
validate the prefix format before splitting (e.g., ensure it starts with "data:"
and contains both ":" and ";" in the expected places or use a small regex like
^data:([^;]+);) and if invalid raise a clear ValueError/BadRequest including the
offending prefix; replace the fragile split logic with this validated extraction
so the mime variable is only set when the prefix matches the expected pattern
and provide a helpful error message when it does not.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/main.py`:
- Around line 101-110: The current check only inspects a data URI prefix (the
"," branch) so raw base64 in v bypasses MIME checks; fix by validating inputs
without a data: prefix: when "," not in v, decode the base64 into bytes and
perform magic-byte/type checks against allowed types (e.g., PNG signature 89 50
4E 47, JPEG FF D8 FF, TIFF 49 49/4D 4D, PDF "%PDF") and if the signature doesn't
match any allowed type raise the same ValueError (or a 422-equivalent) so
non-image bytes are rejected; keep the existing prefix-based MIME check for data
URIs and reuse allowed_types/mime variables for consistent error messages.
- Around line 122-126: In the except block handling Exception as e, preserve the
original exception chain by raising the new ValueError from the caught
exception: replace the final "raise ValueError(\"Invalid base64 image data\")"
with "raise ValueError(\"Invalid base64 image data\") from e"; keep the existing
bare "raise" inside the if branch that checks for "exceeds" or "Unsupported" so
the original exception is propagated unchanged (refer to the exception variable
e and the ValueError being raised).

---

Nitpick comments:
In `@backend/main.py`:
- Line 106: The current one-liner extracting mime from the data URI prefix (mime
= prefix.split(":")[1].split(";")[0] if ":" in prefix else "") is fragile;
update parsing in the function that handles data URIs to validate the prefix
format before splitting (e.g., ensure it starts with "data:" and contains both
":" and ";" in the expected places or use a small regex like ^data:([^;]+);) and
if invalid raise a clear ValueError/BadRequest including the offending prefix;
replace the fragile split logic with this validated extraction so the mime
variable is only set when the prefix matches the expected pattern and provide a
helpful error message when it does not.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a8adfd9-fb1d-478f-95f5-2fad317cc631

📥 Commits

Reviewing files that changed from the base of the PR and between da8faf2 and 44b9bfd.

📒 Files selected for processing (1)

backend/main.py

…chain - Validate file magic bytes (PNG, JPEG, TIFF, PDF) when base64 input lacks data: URI prefix, closing the MIME validation bypass gap - Use 'raise ... from e' to preserve exception context for debugging Addresses CodeRabbit review on PR ritesh-1918#767

ritesh-1918 · 2026-05-31T05:50:54Z

Hi @zeroknowledge0x! Thanks for the contribution. I have triaged your PR and set it to merge into the gssoc branch.

⚠️ MANDATORY GSSOC ONBOARDING STEPS:
Before your PR points are finalized on the leaderboard, you MUST complete these required steps:

⭐ Star this repository: https://github.com/ritesh-1918/HELPDESK.AI (Mandatory)
👤 Follow the Project Admin: https://github.com/ritesh-1918 (Mandatory)
💼 Connect on LinkedIn: https://www.linkedin.com/in/ritesh1908/ (Mandatory)

Welcome to the HELPDESK.AI developer family! 🚀💻

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread backend/main.py

Comment thread backend/main.py

ritesh-1918 changed the base branch from main to gssoc May 31, 2026 05:50

ritesh-1918 added gssoc GirlScript Summer of Code gssoc:approved GSSoC Approved PR level:intermediate Intermediate level difficulty quality:exceptional Exceptional code quality type:bug Bug fix labels May 31, 2026

ritesh-1918 merged commit 2cf4fa1 into ritesh-1918:gssoc May 31, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: validate OCR file upload type and size to prevent server crashes (Fixes #741)#767

fix: validate OCR file upload type and size to prevent server crashes (Fixes #741)#767
ritesh-1918 merged 2 commits into
ritesh-1918:gssocfrom
zeroknowledge0x:fix/ocr-upload-validation-741

zeroknowledge0x commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ritesh-1918 commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zeroknowledge0x commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation Details

Testing

Related Issues

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ritesh-1918 commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zeroknowledge0x commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading