feat(hal): add HAL integration for French open archive by hongkongkiwi · Pull Request #38 · openags/paper-search-mcp

hongkongkiwi · 2026-01-22T08:53:50Z

HAL is a multi-disciplinary open archive for scientific documents from French institutions including theses, preprints, and articles.

Features

Search by query, author, domain
Filter by document type (thesis, preprint, article)
Browse by collection (CNRS, INRIA, universities)
Get author profiles and download files

…ic paper search OpenAlex is a free and open catalog of the global research system with over 200M works. This integration provides: - Full paper search with advanced filters (year, type, full-text availability) - Citation and reference traversal (forward and backward citations) - Author-based search - Related papers discovery based on concepts and references - DOI and OpenAlex ID lookup - Comprehensive metadata including concepts, keywords, and open access info Features: - search_openalex: Main search function with filtering options - get_openalex_paper: Get paper by OpenAlex ID - get_openalex_paper_by_doi: Get paper by DOI - get_openalex_citations: Get papers that cite this work - get_openalex_references: Get papers referenced by this work - search_openalex_by_author: Search papers by author name - get_openalex_related: Find related papers - download_openalex: Download PDF from open access sources - read_openalex_paper: Extract text from PDF

…holar This enhancement adds comprehensive citation and reference functionality: Semantic Scholar enhancements: - get_semantic_citations: Get papers that cite this work (forward citations) - get_semantic_references: Get papers referenced by this work (backward citations) - get_semantic_related: Get related papers based on concepts and citations - search_semantic_by_author: Search papers by author name Note: OpenAlex already has full citation/reference support from the previous feature. These tools enable: - Citation graph traversal (forward and backward) - Related paper discovery - Author-based paper search - Comprehensive citation analysis

Sci-Hub provides access to millions of research papers behind paywalls. This integration exposes the existing Sci-Hub fetcher as an MCP tool. Features: - download_scihub: Download PDFs using DOI, PMID, or URL Note: - Sci-Hub operates in a legal gray area - Only use for legitimate research purposes - Ensure compliance with local laws and institution policies

Same papers often appear in multiple sources (arXiv, Semantic Scholar, etc.). This feature adds intelligent deduplication based on: - DOI matching (most reliable) - Title similarity (>= 90% match) - Author + year matching Features: - deduplicate_papers: Remove duplicates from paper list - merge_papers: Merge duplicates by combining metadata - find_duplicate_groups: Analyze duplicates without removing Use cases: - Combine results from multiple search sources - Remove duplicate papers from aggregated results - Merge complementary metadata from different sources

…pers PMC provides free access to full-text biomedical and life sciences literature. This integration enables searching and downloading complete articles. Features: - search_pmc: Search PMC using E-utilities API - get_pmc_paper: Get paper by PMCID - download_pmc: Download PDF of PMC articles - read_pmc_paper: Extract text from PMC PDFs PMC differs from PubMed (abstracts only) - PMC has complete full-text articles from biomedical and life sciences journals at NIH/NLM.

HAL is a multi-disciplinary open archive for scientific documents from French institutions including theses, preprints, and articles. Features: - Search by query, author, domain - Filter by document type (thesis, preprint, article) - Browse by collection (CNRS, INRIA, universities) - Get author profiles and download files

Copilot

Pull request overview

This PR claims to add HAL (French open archive) integration but actually adds multiple academic platform integrations including OpenAlex, PubMed Central (PMC), Sci-Hub, and deduplication utilities. The HAL searcher implementation is created but never integrated into the server, making the PR title and description misleading.

Changes:

Added OpenAlexSearcher with comprehensive search, citation, and related paper functionality
Added PMCSearcher for PubMed Central biomedical full-text papers
Added SciHubFetcher for downloading PDFs from Sci-Hub
Added deduplication utilities to remove duplicate papers from multiple sources
Added HALSearcher (but not integrated into server.py)
Enhanced SemanticSearcher with citation, reference, related paper, and author search methods

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 28 comments.

Show a summary per file

File	Description
paper_search_mcp/server.py	Added tools for OpenAlex, PMC, Sci-Hub, and deduplication; enhanced Semantic Scholar tools; contains multiple bugs with unused httpx clients
paper_search_mcp/academic_platforms/hal.py	Complete HAL searcher implementation (NOT integrated into server) with syntax error in collections list
paper_search_mcp/academic_platforms/openalex.py	OpenAlex API integration with bugs in year range filtering and related papers query
paper_search_mcp/academic_platforms/pmc.py	PubMed Central integration with bugs in year range filtering and XML parsing
paper_search_mcp/academic_platforms/semantic.py	Added citation, reference, related paper, and author search methods
paper_search_mcp/deduplication.py	Paper deduplication utilities with potential datetime comparison bug

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T10:06:19Z

paper_search_mcp/server.py

+    async with httpx.AsyncClient() as client:
+        papers = semantic_searcher.search_by_author(author_name, max_results)
+        return [paper.to_dict() for paper in papers] if papers else []


An async httpx.AsyncClient is created but never used. The underlying semantic_searcher methods are synchronous and don't need this client. Remove this unused context manager.

Copilot · 2026-01-23T10:06:20Z

paper_search_mcp/server.py

+    async with httpx.AsyncClient() as client:
+        paper = openalex_searcher.get_paper_by_id(paper_id)
+        return paper.to_dict() if paper else {}


An async httpx.AsyncClient is created but never used in this function. The underlying openalex_searcher.get_paper_by_id method is synchronous. Remove this unused context manager.

Copilot · 2026-01-23T10:06:20Z

paper_search_mcp/server.py

+    async with httpx.AsyncClient() as client:
+        paper = pmc_searcher.get_paper_by_pmcid(paper_id)
+        return paper.to_dict() if paper else {}


An async httpx.AsyncClient is created but never used. The underlying pmc_searcher.get_paper_by_pmcid method is synchronous. Remove this unused context manager.

Suggested change

async with httpx.AsyncClient() as client:

paper = pmc_searcher.get_paper_by_pmcid(paper_id)

return paper.to_dict() if paper else {}

paper = pmc_searcher.get_paper_by_pmcid(paper_id)

return paper.to_dict() if paper else {}

Copilot · 2026-01-23T10:06:20Z

paper_search_mcp/deduplication.py

+            if any(a1 in a2 or a2 in a1 for a1 in authors1 for a2 in authors2):
+                return True
+
+    # Check author + year match
+    if paper1.authors and paper2.authors:
+        authors1 = [a.lower().strip() for a in paper1.authors]
+        authors2 = [a.lower().strip() for a in paper2.authors]
+        # Check if at least 2 authors match
+        matching_authors = sum(1 for a1 in authors1 for a2 in authors2 if a1 in a2 or a2 in a1)
+        if matching_authors >= 2:
+            # Check publication year
+            year1 = paper1.published_date.year if paper1.published_date else None
+            year2 = paper2.published_date.year if paper2.published_date else None
+            if year1 and year2 and year1 == year2:
+                return True


The author matching logic has O(n*m) complexity where it checks all author pairs. For papers with many authors, this could be inefficient. Consider converting authors to a set and using set operations for faster matching, or at least short-circuit after finding the first match.

Suggested change

if any(a1 in a2 or a2 in a1 for a1 in authors1 for a2 in authors2):

return True

# Check author + year match

if paper1.authors and paper2.authors:

authors1 = [a.lower().strip() for a in paper1.authors]

authors2 = [a.lower().strip() for a in paper2.authors]

# Check if at least 2 authors match

matching_authors = sum(1 for a1 in authors1 for a2 in authors2 if a1 in a2 or a2 in a1)

if matching_authors >= 2:

# Check publication year

year1 = paper1.published_date.year if paper1.published_date else None

year2 = paper2.published_date.year if paper2.published_date else None

if year1 and year2 and year1 == year2:

return True

for a1 in authors1:

for a2 in authors2:

if a1 in a2 or a2 in a1:

return True

# Check author + year match

if paper1.authors and paper2.authors:

# Check publication year first; if years don't match, author matches can't make them the same paper

year1 = paper1.published_date.year if paper1.published_date else None

year2 = paper2.published_date.year if paper2.published_date else None

if not (year1 and year2 and year1 == year2):

return False

authors1 = [a.lower().strip() for a in paper1.authors]

authors2 = [a.lower().strip() for a in paper2.authors]

# Check if at least 2 authors match, short-circuiting once the threshold is reached

matching_authors = 0

for a1 in authors1:

for a2 in authors2:

if a1 in a2 or a2 in a1:

matching_authors += 1

if matching_authors >= 2:

return True

Copilot · 2026-01-23T10:06:21Z

paper_search_mcp/academic_platforms/pmc.py

+            response = self.session.get(fetch_url, params=params, timeout=30)
+            response.raise_for_status()
+
+            return response.text.decode('utf-8')


response.text is already a string, calling .decode('utf-8') on it will fail with AttributeError. The decode method should only be called on bytes. Either use response.text directly or response.content.decode('utf-8').

Suggested change

return response.text.decode('utf-8')

return response.text

Copilot · 2026-01-23T10:06:26Z

paper_search_mcp/academic_platforms/openalex.py

+                    # Sort by position and join
+                    sorted_indices = sorted(index_to_word.keys())
+                    abstract_text = " ".join([index_to_word[i] for i in sorted_indices])
+                except:


Except block directly handles BaseException.

Suggested change

except:

except Exception:

Copilot · 2026-01-23T10:06:26Z

paper_search_mcp/academic_platforms/pmc.py

+                        month_int = int(month) if month and month.isdigit() else 1
+                        day_int = int(day) if day and day.isdigit() else 1
+                        published_date = datetime(int(year), month_int, day_int)
+                    except:


Except block directly handles BaseException.

Suggested change

except:

except (ValueError, TypeError):

Copilot · 2026-01-23T10:06:26Z

paper_search_mcp/deduplication.py

+    if d.get("published_date"):
+        try:
+            published_date = datetime.fromisoformat(d["published_date"])
+        except:


'except' clause does nothing but pass and there is no explanatory comment.

Copilot · 2026-01-23T10:06:26Z

paper_search_mcp/deduplication.py

+    if d.get("updated_date"):
+        try:
+            updated_date = datetime.fromisoformat(d["updated_date"])
+        except:


'except' clause does nothing but pass and there is no explanatory comment.

Copilot · 2026-01-23T10:06:27Z

paper_search_mcp/academic_platforms/openalex.py

+            if pub_date:
+                try:
+                    published_date = datetime.fromisoformat(pub_date.replace("Z", "+00:00"))
+                except:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

except (ValueError, TypeError):

# Ignore invalid or unexpected publication date formats from the API

hongkongkiwi added 6 commits January 22, 2026 15:14

universea requested a review from Copilot January 23, 2026 09:59

Copilot started reviewing on behalf of universea January 23, 2026 09:59 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hal): add HAL integration for French open archive#38

feat(hal): add HAL integration for French open archive#38
hongkongkiwi wants to merge 6 commits intoopenags:mainfrom
hongkongkiwi:feature/hal-integration

hongkongkiwi commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	except:
	except (ValueError, TypeError):
	# Ignore invalid or unexpected publication date formats from the API

Conversation

hongkongkiwi commented Jan 22, 2026

Features

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants