Skip to content

fix: extract clickable URL annotations from PDF for accurate profile detection#250

Open
siddarthx07 wants to merge 1 commit into
interviewstreet:mainfrom
siddarthx07:fix/extract-pdf-clickable-urls
Open

fix: extract clickable URL annotations from PDF for accurate profile detection#250
siddarthx07 wants to merge 1 commit into
interviewstreet:mainfrom
siddarthx07:fix/extract-pdf-clickable-urls

Conversation

@siddarthx07

Copy link
Copy Markdown

Problem

PDFHandler uses to_markdown() (PyMuPDF) to extract resume text. This only captures visible text — it silently drops PDF link annotations. Many resumes embed hyperlinks where the display text is GitHub or LinkedIn but the actual URL lives as a hidden annotation. The LLM never sees it, so the profiles array in the extracted JSON comes back empty or wrong.

Fix

In extract_text_from_pdf(), reuse the already-open doc to call page.get_links() on each page. Collect all unique http:// / https:// URIs and append them as an explicit block:
=== CLICKABLE LINKS IN RESUME === https://github.com/username https://linkedin.com/in/username

This satisfies the existing basics.jinja constraint ("ONLY extract URLs that are EXPLICITLY present in the resume markdown") with no prompt changes needed.

Details

…detection

Many resumes embed hyperlinks where display text (e.g. "GitHub", "LinkedIn")
hides the actual URL as a PDF link annotation. PyMuPDF's to_markdown() drops
these annotations, so the LLM never sees the real URLs and profile extraction
in the basics section is inaccurate or empty.

In extract_text_from_pdf(), reuse the already-open doc to iterate page.get_links()
and collect all unique HTTP/HTTPS URIs. Append them as an explicit
"=== CLICKABLE LINKS IN RESUME ===" block so the LLM treats them as
explicit resume content — satisfying the existing basics.jinja constraint
that only URLs present in the text should be extracted.

Fixes interviewstreet#152

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Extract clickable URLs from PDF text

1 participant