feat(hal): add HAL integration for French open archive#38
feat(hal): add HAL integration for French open archive#38hongkongkiwi wants to merge 6 commits intoopenags:mainfrom
Conversation
…ic paper search OpenAlex is a free and open catalog of the global research system with over 200M works. This integration provides: - Full paper search with advanced filters (year, type, full-text availability) - Citation and reference traversal (forward and backward citations) - Author-based search - Related papers discovery based on concepts and references - DOI and OpenAlex ID lookup - Comprehensive metadata including concepts, keywords, and open access info Features: - search_openalex: Main search function with filtering options - get_openalex_paper: Get paper by OpenAlex ID - get_openalex_paper_by_doi: Get paper by DOI - get_openalex_citations: Get papers that cite this work - get_openalex_references: Get papers referenced by this work - search_openalex_by_author: Search papers by author name - get_openalex_related: Find related papers - download_openalex: Download PDF from open access sources - read_openalex_paper: Extract text from PDF
…holar This enhancement adds comprehensive citation and reference functionality: Semantic Scholar enhancements: - get_semantic_citations: Get papers that cite this work (forward citations) - get_semantic_references: Get papers referenced by this work (backward citations) - get_semantic_related: Get related papers based on concepts and citations - search_semantic_by_author: Search papers by author name Note: OpenAlex already has full citation/reference support from the previous feature. These tools enable: - Citation graph traversal (forward and backward) - Related paper discovery - Author-based paper search - Comprehensive citation analysis
Sci-Hub provides access to millions of research papers behind paywalls. This integration exposes the existing Sci-Hub fetcher as an MCP tool. Features: - download_scihub: Download PDFs using DOI, PMID, or URL Note: - Sci-Hub operates in a legal gray area - Only use for legitimate research purposes - Ensure compliance with local laws and institution policies
Same papers often appear in multiple sources (arXiv, Semantic Scholar, etc.). This feature adds intelligent deduplication based on: - DOI matching (most reliable) - Title similarity (>= 90% match) - Author + year matching Features: - deduplicate_papers: Remove duplicates from paper list - merge_papers: Merge duplicates by combining metadata - find_duplicate_groups: Analyze duplicates without removing Use cases: - Combine results from multiple search sources - Remove duplicate papers from aggregated results - Merge complementary metadata from different sources
…pers PMC provides free access to full-text biomedical and life sciences literature. This integration enables searching and downloading complete articles. Features: - search_pmc: Search PMC using E-utilities API - get_pmc_paper: Get paper by PMCID - download_pmc: Download PDF of PMC articles - read_pmc_paper: Extract text from PMC PDFs PMC differs from PubMed (abstracts only) - PMC has complete full-text articles from biomedical and life sciences journals at NIH/NLM.
HAL is a multi-disciplinary open archive for scientific documents from French institutions including theses, preprints, and articles. Features: - Search by query, author, domain - Filter by document type (thesis, preprint, article) - Browse by collection (CNRS, INRIA, universities) - Get author profiles and download files
There was a problem hiding this comment.
Pull request overview
This PR claims to add HAL (French open archive) integration but actually adds multiple academic platform integrations including OpenAlex, PubMed Central (PMC), Sci-Hub, and deduplication utilities. The HAL searcher implementation is created but never integrated into the server, making the PR title and description misleading.
Changes:
- Added OpenAlexSearcher with comprehensive search, citation, and related paper functionality
- Added PMCSearcher for PubMed Central biomedical full-text papers
- Added SciHubFetcher for downloading PDFs from Sci-Hub
- Added deduplication utilities to remove duplicate papers from multiple sources
- Added HALSearcher (but not integrated into server.py)
- Enhanced SemanticSearcher with citation, reference, related paper, and author search methods
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 28 comments.
Show a summary per file
| File | Description |
|---|---|
| paper_search_mcp/server.py | Added tools for OpenAlex, PMC, Sci-Hub, and deduplication; enhanced Semantic Scholar tools; contains multiple bugs with unused httpx clients |
| paper_search_mcp/academic_platforms/hal.py | Complete HAL searcher implementation (NOT integrated into server) with syntax error in collections list |
| paper_search_mcp/academic_platforms/openalex.py | OpenAlex API integration with bugs in year range filtering and related papers query |
| paper_search_mcp/academic_platforms/pmc.py | PubMed Central integration with bugs in year range filtering and XML parsing |
| paper_search_mcp/academic_platforms/semantic.py | Added citation, reference, related paper, and author search methods |
| paper_search_mcp/deduplication.py | Paper deduplication utilities with potential datetime comparison bug |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async with httpx.AsyncClient() as client: | ||
| papers = semantic_searcher.search_by_author(author_name, max_results) | ||
| return [paper.to_dict() for paper in papers] if papers else [] |
There was a problem hiding this comment.
An async httpx.AsyncClient is created but never used. The underlying semantic_searcher methods are synchronous and don't need this client. Remove this unused context manager.
| async with httpx.AsyncClient() as client: | ||
| paper = openalex_searcher.get_paper_by_id(paper_id) | ||
| return paper.to_dict() if paper else {} |
There was a problem hiding this comment.
An async httpx.AsyncClient is created but never used in this function. The underlying openalex_searcher.get_paper_by_id method is synchronous. Remove this unused context manager.
| async with httpx.AsyncClient() as client: | ||
| paper = pmc_searcher.get_paper_by_pmcid(paper_id) | ||
| return paper.to_dict() if paper else {} |
There was a problem hiding this comment.
An async httpx.AsyncClient is created but never used. The underlying pmc_searcher.get_paper_by_pmcid method is synchronous. Remove this unused context manager.
| async with httpx.AsyncClient() as client: | |
| paper = pmc_searcher.get_paper_by_pmcid(paper_id) | |
| return paper.to_dict() if paper else {} | |
| paper = pmc_searcher.get_paper_by_pmcid(paper_id) | |
| return paper.to_dict() if paper else {} |
| if any(a1 in a2 or a2 in a1 for a1 in authors1 for a2 in authors2): | ||
| return True | ||
|
|
||
| # Check author + year match | ||
| if paper1.authors and paper2.authors: | ||
| authors1 = [a.lower().strip() for a in paper1.authors] | ||
| authors2 = [a.lower().strip() for a in paper2.authors] | ||
| # Check if at least 2 authors match | ||
| matching_authors = sum(1 for a1 in authors1 for a2 in authors2 if a1 in a2 or a2 in a1) | ||
| if matching_authors >= 2: | ||
| # Check publication year | ||
| year1 = paper1.published_date.year if paper1.published_date else None | ||
| year2 = paper2.published_date.year if paper2.published_date else None | ||
| if year1 and year2 and year1 == year2: | ||
| return True |
There was a problem hiding this comment.
The author matching logic has O(n*m) complexity where it checks all author pairs. For papers with many authors, this could be inefficient. Consider converting authors to a set and using set operations for faster matching, or at least short-circuit after finding the first match.
| if any(a1 in a2 or a2 in a1 for a1 in authors1 for a2 in authors2): | |
| return True | |
| # Check author + year match | |
| if paper1.authors and paper2.authors: | |
| authors1 = [a.lower().strip() for a in paper1.authors] | |
| authors2 = [a.lower().strip() for a in paper2.authors] | |
| # Check if at least 2 authors match | |
| matching_authors = sum(1 for a1 in authors1 for a2 in authors2 if a1 in a2 or a2 in a1) | |
| if matching_authors >= 2: | |
| # Check publication year | |
| year1 = paper1.published_date.year if paper1.published_date else None | |
| year2 = paper2.published_date.year if paper2.published_date else None | |
| if year1 and year2 and year1 == year2: | |
| return True | |
| for a1 in authors1: | |
| for a2 in authors2: | |
| if a1 in a2 or a2 in a1: | |
| return True | |
| # Check author + year match | |
| if paper1.authors and paper2.authors: | |
| # Check publication year first; if years don't match, author matches can't make them the same paper | |
| year1 = paper1.published_date.year if paper1.published_date else None | |
| year2 = paper2.published_date.year if paper2.published_date else None | |
| if not (year1 and year2 and year1 == year2): | |
| return False | |
| authors1 = [a.lower().strip() for a in paper1.authors] | |
| authors2 = [a.lower().strip() for a in paper2.authors] | |
| # Check if at least 2 authors match, short-circuiting once the threshold is reached | |
| matching_authors = 0 | |
| for a1 in authors1: | |
| for a2 in authors2: | |
| if a1 in a2 or a2 in a1: | |
| matching_authors += 1 | |
| if matching_authors >= 2: | |
| return True |
| response = self.session.get(fetch_url, params=params, timeout=30) | ||
| response.raise_for_status() | ||
|
|
||
| return response.text.decode('utf-8') |
There was a problem hiding this comment.
response.text is already a string, calling .decode('utf-8') on it will fail with AttributeError. The decode method should only be called on bytes. Either use response.text directly or response.content.decode('utf-8').
| return response.text.decode('utf-8') | |
| return response.text |
| # Sort by position and join | ||
| sorted_indices = sorted(index_to_word.keys()) | ||
| abstract_text = " ".join([index_to_word[i] for i in sorted_indices]) | ||
| except: |
There was a problem hiding this comment.
Except block directly handles BaseException.
| except: | |
| except Exception: |
| month_int = int(month) if month and month.isdigit() else 1 | ||
| day_int = int(day) if day and day.isdigit() else 1 | ||
| published_date = datetime(int(year), month_int, day_int) | ||
| except: |
There was a problem hiding this comment.
Except block directly handles BaseException.
| except: | |
| except (ValueError, TypeError): |
| if d.get("published_date"): | ||
| try: | ||
| published_date = datetime.fromisoformat(d["published_date"]) | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if d.get("updated_date"): | ||
| try: | ||
| updated_date = datetime.fromisoformat(d["updated_date"]) | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if pub_date: | ||
| try: | ||
| published_date = datetime.fromisoformat(pub_date.replace("Z", "+00:00")) | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except (ValueError, TypeError): | |
| # Ignore invalid or unexpected publication date formats from the API |
HAL is a multi-disciplinary open archive for scientific documents from French institutions including theses, preprints, and articles.
Features