fix: extract clickable URL annotations from PDF for accurate profile detection by siddarthx07 · Pull Request #250 · interviewstreet/hiring-agent

siddarthx07 · 2026-06-24T15:00:16Z

Problem

PDFHandler uses to_markdown() (PyMuPDF) to extract resume text. This only captures visible text — it silently drops PDF link annotations. Many resumes embed hyperlinks where the display text is GitHub or LinkedIn but the actual URL lives as a hidden annotation. The LLM never sees it, so the profiles array in the extracted JSON comes back empty or wrong.

Fix

In extract_text_from_pdf(), reuse the already-open doc to call page.get_links() on each page. Collect all unique http:// / https:// URIs and append them as an explicit block:
=== CLICKABLE LINKS IN RESUME === https://github.com/username https://linkedin.com/in/username

This satisfies the existing basics.jinja constraint ("ONLY extract URLs that are EXPLICITLY present in the resume markdown") with no prompt changes needed.

Details

No extra pymupdf.open() call — reuses the already-open doc
Deduplicates across pages via a seen_uris set
Filters to HTTP/HTTPS only (ignores mailto:, internal #anchor links)
pymupdf_rag.py is untouched
Fixes Feature Request: Extract clickable URLs from PDF text #152

…detection Many resumes embed hyperlinks where display text (e.g. "GitHub", "LinkedIn") hides the actual URL as a PDF link annotation. PyMuPDF's to_markdown() drops these annotations, so the LLM never sees the real URLs and profile extraction in the basics section is inaccurate or empty. In extract_text_from_pdf(), reuse the already-open doc to iterate page.get_links() and collect all unique HTTP/HTTPS URIs. Append them as an explicit "=== CLICKABLE LINKS IN RESUME ===" block so the LLM treats them as explicit resume content — satisfying the existing basics.jinja constraint that only URLs present in the text should be extracted. Fixes interviewstreet#152 Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: extract clickable URL annotations from PDF for accurate profile detection#250

fix: extract clickable URL annotations from PDF for accurate profile detection#250
siddarthx07 wants to merge 1 commit into
interviewstreet:mainfrom
siddarthx07:fix/extract-pdf-clickable-urls

siddarthx07 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

siddarthx07 commented Jun 24, 2026

Problem

Fix

Details

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant