feat(openalex): add OpenAlex API integration#30
feat(openalex): add OpenAlex API integration#30hongkongkiwi wants to merge 1 commit intoopenags:mainfrom
Conversation
…ic paper search OpenAlex is a free and open catalog of the global research system with over 200M works. This integration provides: - Full paper search with advanced filters (year, type, full-text availability) - Citation and reference traversal (forward and backward citations) - Author-based search - Related papers discovery based on concepts and references - DOI and OpenAlex ID lookup - Comprehensive metadata including concepts, keywords, and open access info Features: - search_openalex: Main search function with filtering options - get_openalex_paper: Get paper by OpenAlex ID - get_openalex_paper_by_doi: Get paper by DOI - get_openalex_citations: Get papers that cite this work - get_openalex_references: Get papers referenced by this work - search_openalex_by_author: Search papers by author name - get_openalex_related: Find related papers - download_openalex: Download PDF from open access sources - read_openalex_paper: Extract text from PDF
There was a problem hiding this comment.
Pull request overview
Adds OpenAlex as an additional academic source, exposing MCP tools for searching works and traversing citation/references/related-work graphs, plus optional PDF download/read support via open-access locations.
Changes:
- Introduces
OpenAlexSearcherimplementation backed by the OpenAlex REST API. - Wires new OpenAlex MCP tools into
paper_search_mcp/server.py(search, lookup, citations, references, author search, related, download, read). - Adds OpenAlex-specific parsing for works into the project’s standardized
Paperformat.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 15 comments.
| File | Description |
|---|---|
paper_search_mcp/server.py |
Registers OpenAlex searcher instance and exposes OpenAlex MCP tools. |
paper_search_mcp/academic_platforms/openalex.py |
Implements OpenAlex API calls, result parsing, and PDF download/read helpers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Year range | ||
| params["filter"] = f"from_publication_date:{year}" |
There was a problem hiding this comment.
Year-range handling sets filter = from_publication_date:{year} when year is in the YYYY-YYYY form (per docstring/examples). from_publication_date is a date field, so this produces an invalid/ambiguous filter. Convert ranges to proper date filters (e.g., from_publication_date:YYYY-01-01,to_publication_date:YYYY-12-31) or use publication_year constraints.
| # Year range | |
| params["filter"] = f"from_publication_date:{year}" | |
| # Year range, expected format: 'YYYY-YYYY' | |
| start_year, end_year = [part.strip() for part in year.split("-", 1)] | |
| if ( | |
| len(start_year) == 4 | |
| and len(end_year) == 4 | |
| and start_year.isdigit() | |
| and end_year.isdigit() | |
| ): | |
| params["filter"] = ( | |
| f"from_publication_date:{start_year}-01-01," | |
| f"to_publication_date:{end_year}-12-31" | |
| ) | |
| else: | |
| # Fallback: treat as a single publication_year filter | |
| params["filter"] = f"publication_year:{year}" |
| params = { | ||
| "filter": f"referenced_by:{openalex_id}", | ||
| "per-page": max_results, | ||
| "mailto": self.EMAIL_PARAM | ||
| } |
There was a problem hiding this comment.
get_references uses filter=referenced_by:{openalex_id}, which reads as “works that reference this work” (forward citations), not “works referenced by this work”. To return actual references, fetch the work first, read its referenced_works IDs, then bulk-fetch those works’ metadata. Also consider clamping per-page to 200 to match OpenAlex’s max page size.
| openalex_id: OpenAlex ID | ||
| max_results: Maximum number of related papers to return | ||
|
|
||
| Returns: | ||
| List of related Paper objects | ||
| """ | ||
| if openalex_id.startswith("http"): | ||
| openalex_id = openalex_id.split("/")[-1] | ||
|
|
||
| url = f"{self.BASE_URL}/works" | ||
| params = { | ||
| "filter": f"has_concepts:{openalex_id}", | ||
| "per-page": max_results, | ||
| "sort": "cited_by_count:desc", | ||
| "mailto": self.EMAIL_PARAM | ||
| } | ||
|
|
||
| try: |
There was a problem hiding this comment.
get_related_papers uses filter=has_concepts:{openalex_id}, but openalex_id here is a work ID (W...), not a concept ID. This is likely to return incorrect/empty results. Fetch the work, extract its concept IDs, and filter by those concept IDs (or use an OpenAlex-provided related-works mechanism if available).
| openalex_id: OpenAlex ID | |
| max_results: Maximum number of related papers to return | |
| Returns: | |
| List of related Paper objects | |
| """ | |
| if openalex_id.startswith("http"): | |
| openalex_id = openalex_id.split("/")[-1] | |
| url = f"{self.BASE_URL}/works" | |
| params = { | |
| "filter": f"has_concepts:{openalex_id}", | |
| "per-page": max_results, | |
| "sort": "cited_by_count:desc", | |
| "mailto": self.EMAIL_PARAM | |
| } | |
| try: | |
| openalex_id: OpenAlex work ID or URL | |
| max_results: Maximum number of related papers to return | |
| Returns: | |
| List of related Paper objects | |
| """ | |
| # Normalize to bare OpenAlex work ID (e.g., "W123456789") | |
| if openalex_id.startswith("http"): | |
| openalex_id = openalex_id.split("/")[-1] | |
| try: | |
| # First, fetch the work to get its associated concepts | |
| work_url = f"{self.BASE_URL}/works/{openalex_id}" | |
| work_params = { | |
| "mailto": self.EMAIL_PARAM | |
| } | |
| work_response = requests.get(work_url, params=work_params, timeout=30) | |
| work_response.raise_for_status() | |
| work_data = work_response.json() | |
| concepts = work_data.get("concepts") or [] | |
| concept_ids = [ | |
| concept.get("id", "").split("/")[-1] | |
| for concept in concepts | |
| if concept.get("id") | |
| ] | |
| # If no concepts are associated with this work, we cannot find concept-based related papers | |
| if not concept_ids: | |
| return [] | |
| # Build a filter that matches works sharing any of these concepts | |
| concept_filter_value = "|".join(concept_ids) | |
| url = f"{self.BASE_URL}/works" | |
| params = { | |
| "filter": f"concepts.id:{concept_filter_value}", | |
| "per-page": max_results, | |
| "sort": "cited_by_count:desc", | |
| "mailto": self.EMAIL_PARAM | |
| } |
| "filter": f"author.id:{author_id}", | ||
| "per-page": max_results, | ||
| "sort": "publication_date:desc", | ||
| "mailto": self.EMAIL_PARAM | ||
| } |
There was a problem hiding this comment.
search_by_author sets per-page directly from max_results without clamping to OpenAlex’s max page size (200). Consider per-page = min(max_results, 200) (and pagination if you want more than 200).
| """ | ||
| from typing import List, Optional | ||
| from datetime import datetime | ||
| import time |
There was a problem hiding this comment.
Unused imports: time is imported but never used in this module. Removing it avoids linter failures and reduces noise.
| import time |
| **kwargs: Additional search parameters: | ||
| - filter: OpenAlex filter (e.g., 'has_fulltext:true,type:journal-article') | ||
| - sort: Sort field (e.g., 'cited_by_count:desc', 'publication_date:desc') | ||
| - fields: Comma-separated list of fields to return | ||
|
|
There was a problem hiding this comment.
The docstring advertises a fields kwarg for search_openalex, but the implementation never reads/forwards it. Either add fields passthrough to search_kwargs or remove it from the docstring so callers don’t assume it works.
| BASE_URL = "https://api.openalex.org" | ||
| EMAIL_PARAM = "mailto:paper-search-mcp@example.com" # Polite identification | ||
|
|
||
| def __init__(self, email: Optional[str] = None): | ||
| """Initialize OpenAlex searcher. |
There was a problem hiding this comment.
EMAIL_PARAM includes a mailto: prefix but is then passed as the value of the mailto query parameter (resulting in mailto=mailto:...). Store just the email address (no prefix) and pass that as params['mailto'] (see how CrossRef does it in paper_search_mcp/academic_platforms/crossref.py:67).
| if pub_date: | ||
| try: | ||
| published_date = datetime.fromisoformat(pub_date.replace("Z", "+00:00")) | ||
| except: |
There was a problem hiding this comment.
Except block directly handles BaseException.
| except: | |
| except (ValueError, TypeError): | |
| # If the publication date is malformed or of an unexpected type, | |
| # leave published_date as None and continue. |
| # Sort by position and join | ||
| sorted_indices = sorted(index_to_word.keys()) | ||
| abstract_text = " ".join([index_to_word[i] for i in sorted_indices]) | ||
| except: |
There was a problem hiding this comment.
Except block directly handles BaseException.
| except: | |
| except Exception: |
| except: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| pass | |
| except (ValueError, TypeError): | |
| # If the publication date is not a valid ISO format, treat it as missing. | |
| published_date = None |
OpenAlex integration with 200M+ works, citations, author search. Adds search_openalex, get_openalex_paper, get_openalex_citations, get_openalex_references, search_openalex_by_author, get_openalex_related, download_openalex, read_openalex_paper tools.