Skip to content

feat(openalex): add OpenAlex API integration#30

Open
hongkongkiwi wants to merge 1 commit intoopenags:mainfrom
hongkongkiwi:feature/openalex-integration
Open

feat(openalex): add OpenAlex API integration#30
hongkongkiwi wants to merge 1 commit intoopenags:mainfrom
hongkongkiwi:feature/openalex-integration

Conversation

@hongkongkiwi
Copy link

OpenAlex integration with 200M+ works, citations, author search. Adds search_openalex, get_openalex_paper, get_openalex_citations, get_openalex_references, search_openalex_by_author, get_openalex_related, download_openalex, read_openalex_paper tools.

…ic paper search

OpenAlex is a free and open catalog of the global research system with over 200M works.
This integration provides:

- Full paper search with advanced filters (year, type, full-text availability)
- Citation and reference traversal (forward and backward citations)
- Author-based search
- Related papers discovery based on concepts and references
- DOI and OpenAlex ID lookup
- Comprehensive metadata including concepts, keywords, and open access info

Features:
- search_openalex: Main search function with filtering options
- get_openalex_paper: Get paper by OpenAlex ID
- get_openalex_paper_by_doi: Get paper by DOI
- get_openalex_citations: Get papers that cite this work
- get_openalex_references: Get papers referenced by this work
- search_openalex_by_author: Search papers by author name
- get_openalex_related: Find related papers
- download_openalex: Download PDF from open access sources
- read_openalex_paper: Extract text from PDF
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenAlex as an additional academic source, exposing MCP tools for searching works and traversing citation/references/related-work graphs, plus optional PDF download/read support via open-access locations.

Changes:

  • Introduces OpenAlexSearcher implementation backed by the OpenAlex REST API.
  • Wires new OpenAlex MCP tools into paper_search_mcp/server.py (search, lookup, citations, references, author search, related, download, read).
  • Adds OpenAlex-specific parsing for works into the project’s standardized Paper format.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 15 comments.

File Description
paper_search_mcp/server.py Registers OpenAlex searcher instance and exposes OpenAlex MCP tools.
paper_search_mcp/academic_platforms/openalex.py Implements OpenAlex API calls, result parsing, and PDF download/read helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to +76
# Year range
params["filter"] = f"from_publication_date:{year}"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Year-range handling sets filter = from_publication_date:{year} when year is in the YYYY-YYYY form (per docstring/examples). from_publication_date is a date field, so this produces an invalid/ambiguous filter. Convert ranges to proper date filters (e.g., from_publication_date:YYYY-01-01,to_publication_date:YYYY-12-31) or use publication_year constraints.

Suggested change
# Year range
params["filter"] = f"from_publication_date:{year}"
# Year range, expected format: 'YYYY-YYYY'
start_year, end_year = [part.strip() for part in year.split("-", 1)]
if (
len(start_year) == 4
and len(end_year) == 4
and start_year.isdigit()
and end_year.isdigit()
):
params["filter"] = (
f"from_publication_date:{start_year}-01-01,"
f"to_publication_date:{end_year}-12-31"
)
else:
# Fallback: treat as a single publication_year filter
params["filter"] = f"publication_year:{year}"

Copilot uses AI. Check for mistakes.
Comment on lines +216 to +220
params = {
"filter": f"referenced_by:{openalex_id}",
"per-page": max_results,
"mailto": self.EMAIL_PARAM
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_references uses filter=referenced_by:{openalex_id}, which reads as “works that reference this work” (forward citations), not “works referenced by this work”. To return actual references, fetch the work first, read its referenced_works IDs, then bulk-fetch those works’ metadata. Also consider clamping per-page to 200 to match OpenAlex’s max page size.

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +326
openalex_id: OpenAlex ID
max_results: Maximum number of related papers to return

Returns:
List of related Paper objects
"""
if openalex_id.startswith("http"):
openalex_id = openalex_id.split("/")[-1]

url = f"{self.BASE_URL}/works"
params = {
"filter": f"has_concepts:{openalex_id}",
"per-page": max_results,
"sort": "cited_by_count:desc",
"mailto": self.EMAIL_PARAM
}

try:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_related_papers uses filter=has_concepts:{openalex_id}, but openalex_id here is a work ID (W...), not a concept ID. This is likely to return incorrect/empty results. Fetch the work, extract its concept IDs, and filter by those concept IDs (or use an OpenAlex-provided related-works mechanism if available).

Suggested change
openalex_id: OpenAlex ID
max_results: Maximum number of related papers to return
Returns:
List of related Paper objects
"""
if openalex_id.startswith("http"):
openalex_id = openalex_id.split("/")[-1]
url = f"{self.BASE_URL}/works"
params = {
"filter": f"has_concepts:{openalex_id}",
"per-page": max_results,
"sort": "cited_by_count:desc",
"mailto": self.EMAIL_PARAM
}
try:
openalex_id: OpenAlex work ID or URL
max_results: Maximum number of related papers to return
Returns:
List of related Paper objects
"""
# Normalize to bare OpenAlex work ID (e.g., "W123456789")
if openalex_id.startswith("http"):
openalex_id = openalex_id.split("/")[-1]
try:
# First, fetch the work to get its associated concepts
work_url = f"{self.BASE_URL}/works/{openalex_id}"
work_params = {
"mailto": self.EMAIL_PARAM
}
work_response = requests.get(work_url, params=work_params, timeout=30)
work_response.raise_for_status()
work_data = work_response.json()
concepts = work_data.get("concepts") or []
concept_ids = [
concept.get("id", "").split("/")[-1]
for concept in concepts
if concept.get("id")
]
# If no concepts are associated with this work, we cannot find concept-based related papers
if not concept_ids:
return []
# Build a filter that matches works sharing any of these concepts
concept_filter_value = "|".join(concept_ids)
url = f"{self.BASE_URL}/works"
params = {
"filter": f"concepts.id:{concept_filter_value}",
"per-page": max_results,
"sort": "cited_by_count:desc",
"mailto": self.EMAIL_PARAM
}

Copilot uses AI. Check for mistakes.
Comment on lines +274 to +278
"filter": f"author.id:{author_id}",
"per-page": max_results,
"sort": "publication_date:desc",
"mailto": self.EMAIL_PARAM
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search_by_author sets per-page directly from max_results without clamping to OpenAlex’s max page size (200). Consider per-page = min(max_results, 200) (and pagination if you want more than 200).

Copilot uses AI. Check for mistakes.
"""
from typing import List, Optional
from datetime import datetime
import time
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports: time is imported but never used in this module. Removing it avoids linter failures and reduces noise.

Suggested change
import time

Copilot uses AI. Check for mistakes.
Comment on lines +455 to +459
**kwargs: Additional search parameters:
- filter: OpenAlex filter (e.g., 'has_fulltext:true,type:journal-article')
- sort: Sort field (e.g., 'cited_by_count:desc', 'publication_date:desc')
- fields: Comma-separated list of fields to return

Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring advertises a fields kwarg for search_openalex, but the implementation never reads/forwards it. Either add fields passthrough to search_kwargs or remove it from the docstring so callers don’t assume it works.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +27
BASE_URL = "https://api.openalex.org"
EMAIL_PARAM = "mailto:paper-search-mcp@example.com" # Polite identification

def __init__(self, email: Optional[str] = None):
"""Initialize OpenAlex searcher.
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EMAIL_PARAM includes a mailto: prefix but is then passed as the value of the mailto query parameter (resulting in mailto=mailto:...). Store just the email address (no prefix) and pass that as params['mailto'] (see how CrossRef does it in paper_search_mcp/academic_platforms/crossref.py:67).

Copilot uses AI. Check for mistakes.
if pub_date:
try:
published_date = datetime.fromisoformat(pub_date.replace("Z", "+00:00"))
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Suggested change
except:
except (ValueError, TypeError):
# If the publication date is malformed or of an unexpected type,
# leave published_date as None and continue.

Copilot uses AI. Check for mistakes.
# Sort by position and join
sorted_indices = sorted(index_to_word.keys())
abstract_text = " ".join([index_to_word[i] for i in sorted_indices])
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Suggested change
except:
except Exception:

Copilot uses AI. Check for mistakes.
Comment on lines +369 to +370
except:
pass
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except (ValueError, TypeError):
# If the publication date is not a valid ISO format, treat it as missing.
published_date = None

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants