Skip to content

fix: validate OCR file upload type and size to prevent server crashes (Fixes #741)#767

Merged
ritesh-1918 merged 2 commits into
ritesh-1918:gssocfrom
zeroknowledge0x:fix/ocr-upload-validation-741
May 31, 2026
Merged

fix: validate OCR file upload type and size to prevent server crashes (Fixes #741)#767
ritesh-1918 merged 2 commits into
ritesh-1918:gssocfrom
zeroknowledge0x:fix/ocr-upload-validation-741

Conversation

@zeroknowledge0x
Copy link
Copy Markdown

@zeroknowledge0x zeroknowledge0x commented May 30, 2026

Summary

Adds server-side validation for the OCR file upload endpoint to prevent crashes from processing extremely large files or non-image file types.

Changes

  • Add Pydantic field_validator for image_base64 field in TicketRequest model
  • MIME type validation: Only allows PNG, JPEG, TIFF, and PDF (when data URI prefix is present)
  • File size validation: Max 10MB decoded size
  • Descriptive error messages: Returns clear 422 validation errors for invalid uploads

Validation Details

  • Parses data URI prefix (data:image/png;base64,...) to extract and validate MIME type
  • Decodes base64 payload and checks byte length against 10MB limit
  • Gracefully handles missing data URI prefix (raw base64 without type info)
  • Invalid base64 data is caught and returns a clear error

Testing

  • Upload valid PNG/JPEG → accepted
  • Upload 15MB image → 422 error with size message
  • Upload .exe as base64 → 422 error with type message
  • Upload invalid base64 → 422 error with "Invalid base64" message
  • No image → accepted (empty string passes through)

Related Issues

Fixes #741

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced image upload validation with stricter format and size enforcement
    • Added support for PNG, JPEG, TIFF, and PDF image formats
    • Enforced 10MB maximum file size limit for uploads
    • Improved error messaging for invalid submissions

Review Change Stack

- Add Pydantic field_validator for image_base64 field in TicketRequest
- Validate MIME type: only PNG, JPEG, TIFF, PDF allowed
- Validate decoded file size: max 10MB
- Return descriptive error messages for invalid uploads
- Prevents server crashes from processing extremely large or non-image files

Fixes ritesh-1918#741
@vercel
Copy link
Copy Markdown

vercel Bot commented May 30, 2026

@zeroknowledge0x is attempting to deploy a commit to the ritesh Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Warning

Review limit reached

@unsiqasik, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 21 minutes and 20 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 275f2544-4b4b-4e74-aa41-e1028461e222

📥 Commits

Reviewing files that changed from the base of the PR and between 44b9bfd and 2cf4fa1.

📒 Files selected for processing (1)
  • backend/main.py
📝 Walkthrough

Walkthrough

The PR adds image base64 validation to TicketRequest via a new Pydantic field_validator. The validator accepts empty strings, strips data-URI prefixes, enforces allowed MIME types (PNG, JPEG, TIFF, PDF), strictly decodes base64, and rejects payloads exceeding 10MB.

Changes

Image Base64 Validation

Layer / File(s) Summary
TicketRequest image_base64 validator with MIME type and size checks
backend/main.py
Import field_validator from Pydantic and alias base64 module. Implement validate_image_base64 validator that accepts empty values, strips optional data-URI prefixes, restricts MIME types to image/png, image/jpeg, image/tiff, and application/pdf, repairs base64 padding, strictly decodes base64, enforces 10MB decoded-size limit, and raises ValueError with specific messages for unsupported types, oversized content, or invalid encoding.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

  • #741: PR directly addresses missing file type and size validation in the OCR upload endpoint by adding strict MIME type enforcement and a 10MB limit to base64 image processing.

Possibly related PRs

  • ritesh-1918/HELPDESK.AI#523: Both PRs modify backend/main.py's TicketRequest.image_base64 handling using Pydantic field_validator to reject invalid and oversized base64 inputs.

Suggested labels

type:bug, level:intermediate

Poem

🐰 A file comes in, too big, too strange,
The validator stands guard with measured range—
Ten megabytes, no more it shall store,
Base64 strict, MIME types at core!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing validation for OCR file uploads to prevent server crashes, which aligns with the core objective.
Linked Issues check ✅ Passed The PR implementation meets all coding requirements from issue #741: validates MIME types (PNG, JPEG, TIFF, PDF), enforces 10MB size limit, verifies decoded base64 validity, and returns descriptive 422 errors.
Out of Scope Changes check ✅ Passed All changes are directly related to the OCR upload validation objective; no out-of-scope modifications detected beyond the necessary field validator implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/main.py (1)

106-106: ⚡ Quick win

Improve data URI parsing robustness.

The MIME extraction assumes a well-formed data URI. Consider validating the prefix format more strictly to provide clearer error messages for malformed inputs.

♻️ More robust parsing
-    mime = prefix.split(":")[1].split(";")[0] if ":" in prefix else ""
-    if mime and mime not in allowed_types:
+    # Validate data URI format: data:<mime>;base64
+    if ":" not in prefix or not prefix.startswith("data:"):
+        mime = ""
+    else:
+        mime = prefix.split(":", 1)[1].split(";")[0]
+    
+    if mime and mime not in allowed_types:
         raise ValueError(
             f"Unsupported file type '{mime}'. Allowed: PNG, JPEG, TIFF, PDF"
         )
+    elif ":" in prefix and not mime:
+        raise ValueError("Invalid data URI format: could not extract MIME type")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/main.py` at line 106, The current one-liner extracting mime from the
data URI prefix (mime = prefix.split(":")[1].split(";")[0] if ":" in prefix else
"") is fragile; update parsing in the function that handles data URIs to
validate the prefix format before splitting (e.g., ensure it starts with "data:"
and contains both ":" and ";" in the expected places or use a small regex like
^data:([^;]+);) and if invalid raise a clear ValueError/BadRequest including the
offending prefix; replace the fragile split logic with this validated extraction
so the mime variable is only set when the prefix matches the expected pattern
and provide a helpful error message when it does not.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/main.py`:
- Around line 101-110: The current check only inspects a data URI prefix (the
"," branch) so raw base64 in v bypasses MIME checks; fix by validating inputs
without a data: prefix: when "," not in v, decode the base64 into bytes and
perform magic-byte/type checks against allowed types (e.g., PNG signature 89 50
4E 47, JPEG FF D8 FF, TIFF 49 49/4D 4D, PDF "%PDF") and if the signature doesn't
match any allowed type raise the same ValueError (or a 422-equivalent) so
non-image bytes are rejected; keep the existing prefix-based MIME check for data
URIs and reuse allowed_types/mime variables for consistent error messages.
- Around line 122-126: In the except block handling Exception as e, preserve the
original exception chain by raising the new ValueError from the caught
exception: replace the final "raise ValueError(\"Invalid base64 image data\")"
with "raise ValueError(\"Invalid base64 image data\") from e"; keep the existing
bare "raise" inside the if branch that checks for "exceeds" or "Unsupported" so
the original exception is propagated unchanged (refer to the exception variable
e and the ValueError being raised).

---

Nitpick comments:
In `@backend/main.py`:
- Line 106: The current one-liner extracting mime from the data URI prefix (mime
= prefix.split(":")[1].split(";")[0] if ":" in prefix else "") is fragile;
update parsing in the function that handles data URIs to validate the prefix
format before splitting (e.g., ensure it starts with "data:" and contains both
":" and ";" in the expected places or use a small regex like ^data:([^;]+);) and
if invalid raise a clear ValueError/BadRequest including the offending prefix;
replace the fragile split logic with this validated extraction so the mime
variable is only set when the prefix matches the expected pattern and provide a
helpful error message when it does not.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a8adfd9-fb1d-478f-95f5-2fad317cc631

📥 Commits

Reviewing files that changed from the base of the PR and between da8faf2 and 44b9bfd.

📒 Files selected for processing (1)
  • backend/main.py

Comment thread backend/main.py
Comment thread backend/main.py
…chain

- Validate file magic bytes (PNG, JPEG, TIFF, PDF) when base64 input lacks
  data: URI prefix, closing the MIME validation bypass gap
- Use 'raise ... from e' to preserve exception context for debugging

Addresses CodeRabbit review on PR ritesh-1918#767
@ritesh-1918 ritesh-1918 changed the base branch from main to gssoc May 31, 2026 05:50
@ritesh-1918 ritesh-1918 added gssoc GirlScript Summer of Code gssoc:approved GSSoC Approved PR level:intermediate Intermediate level difficulty quality:exceptional Exceptional code quality type:bug Bug fix labels May 31, 2026
@ritesh-1918
Copy link
Copy Markdown
Owner

Hi @zeroknowledge0x! Thanks for the contribution. I have triaged your PR and set it to merge into the gssoc branch.

⚠️ MANDATORY GSSOC ONBOARDING STEPS:
Before your PR points are finalized on the leaderboard, you MUST complete these required steps:

  1. Star this repository: https://github.com/ritesh-1918/HELPDESK.AI (Mandatory)
  2. 👤 Follow the Project Admin: https://github.com/ritesh-1918 (Mandatory)
  3. 💼 Connect on LinkedIn: https://www.linkedin.com/in/ritesh1908/ (Mandatory)

Welcome to the HELPDESK.AI developer family! 🚀💻

@ritesh-1918 ritesh-1918 merged commit 2cf4fa1 into ritesh-1918:gssoc May 31, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc:approved GSSoC Approved PR gssoc GirlScript Summer of Code level:intermediate Intermediate level difficulty quality:exceptional Exceptional code quality type:bug Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] OCR file upload endpoint does not validate file type or size risking server crashes

3 participants