Skip to content

feat(hal): add HAL integration for French open archive#38

Open
hongkongkiwi wants to merge 6 commits intoopenags:mainfrom
hongkongkiwi:feature/hal-integration
Open

feat(hal): add HAL integration for French open archive#38
hongkongkiwi wants to merge 6 commits intoopenags:mainfrom
hongkongkiwi:feature/hal-integration

Conversation

@hongkongkiwi
Copy link

HAL is a multi-disciplinary open archive for scientific documents from French institutions including theses, preprints, and articles.

Features

  • Search by query, author, domain
  • Filter by document type (thesis, preprint, article)
  • Browse by collection (CNRS, INRIA, universities)
  • Get author profiles and download files

…ic paper search

OpenAlex is a free and open catalog of the global research system with over 200M works.
This integration provides:

- Full paper search with advanced filters (year, type, full-text availability)
- Citation and reference traversal (forward and backward citations)
- Author-based search
- Related papers discovery based on concepts and references
- DOI and OpenAlex ID lookup
- Comprehensive metadata including concepts, keywords, and open access info

Features:
- search_openalex: Main search function with filtering options
- get_openalex_paper: Get paper by OpenAlex ID
- get_openalex_paper_by_doi: Get paper by DOI
- get_openalex_citations: Get papers that cite this work
- get_openalex_references: Get papers referenced by this work
- search_openalex_by_author: Search papers by author name
- get_openalex_related: Find related papers
- download_openalex: Download PDF from open access sources
- read_openalex_paper: Extract text from PDF
…holar

This enhancement adds comprehensive citation and reference functionality:

Semantic Scholar enhancements:
- get_semantic_citations: Get papers that cite this work (forward citations)
- get_semantic_references: Get papers referenced by this work (backward citations)
- get_semantic_related: Get related papers based on concepts and citations
- search_semantic_by_author: Search papers by author name

Note: OpenAlex already has full citation/reference support from the previous feature.

These tools enable:
- Citation graph traversal (forward and backward)
- Related paper discovery
- Author-based paper search
- Comprehensive citation analysis
Sci-Hub provides access to millions of research papers behind paywalls.
This integration exposes the existing Sci-Hub fetcher as an MCP tool.

Features:
- download_scihub: Download PDFs using DOI, PMID, or URL

Note:
- Sci-Hub operates in a legal gray area
- Only use for legitimate research purposes
- Ensure compliance with local laws and institution policies
Same papers often appear in multiple sources (arXiv, Semantic Scholar, etc.).
This feature adds intelligent deduplication based on:
- DOI matching (most reliable)
- Title similarity (>= 90% match)
- Author + year matching

Features:
- deduplicate_papers: Remove duplicates from paper list
- merge_papers: Merge duplicates by combining metadata
- find_duplicate_groups: Analyze duplicates without removing

Use cases:
- Combine results from multiple search sources
- Remove duplicate papers from aggregated results
- Merge complementary metadata from different sources
…pers

PMC provides free access to full-text biomedical and life sciences literature.
This integration enables searching and downloading complete articles.

Features:
- search_pmc: Search PMC using E-utilities API
- get_pmc_paper: Get paper by PMCID
- download_pmc: Download PDF of PMC articles
- read_pmc_paper: Extract text from PMC PDFs

PMC differs from PubMed (abstracts only) - PMC has complete full-text articles
from biomedical and life sciences journals at NIH/NLM.
HAL is a multi-disciplinary open archive for scientific documents
from French institutions including theses, preprints, and articles.

Features:
- Search by query, author, domain
- Filter by document type (thesis, preprint, article)
- Browse by collection (CNRS, INRIA, universities)
- Get author profiles and download files
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR claims to add HAL (French open archive) integration but actually adds multiple academic platform integrations including OpenAlex, PubMed Central (PMC), Sci-Hub, and deduplication utilities. The HAL searcher implementation is created but never integrated into the server, making the PR title and description misleading.

Changes:

  • Added OpenAlexSearcher with comprehensive search, citation, and related paper functionality
  • Added PMCSearcher for PubMed Central biomedical full-text papers
  • Added SciHubFetcher for downloading PDFs from Sci-Hub
  • Added deduplication utilities to remove duplicate papers from multiple sources
  • Added HALSearcher (but not integrated into server.py)
  • Enhanced SemanticSearcher with citation, reference, related paper, and author search methods

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 28 comments.

Show a summary per file
File Description
paper_search_mcp/server.py Added tools for OpenAlex, PMC, Sci-Hub, and deduplication; enhanced Semantic Scholar tools; contains multiple bugs with unused httpx clients
paper_search_mcp/academic_platforms/hal.py Complete HAL searcher implementation (NOT integrated into server) with syntax error in collections list
paper_search_mcp/academic_platforms/openalex.py OpenAlex API integration with bugs in year range filtering and related papers query
paper_search_mcp/academic_platforms/pmc.py PubMed Central integration with bugs in year range filtering and XML parsing
paper_search_mcp/academic_platforms/semantic.py Added citation, reference, related paper, and author search methods
paper_search_mcp/deduplication.py Paper deduplication utilities with potential datetime comparison bug

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +425 to +427
async with httpx.AsyncClient() as client:
papers = semantic_searcher.search_by_author(author_name, max_results)
return [paper.to_dict() for paper in papers] if papers else []
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An async httpx.AsyncClient is created but never used. The underlying semantic_searcher methods are synchronous and don't need this client. Remove this unused context manager.

Copilot uses AI. Check for mistakes.
Comment on lines +580 to +582
async with httpx.AsyncClient() as client:
paper = openalex_searcher.get_paper_by_id(paper_id)
return paper.to_dict() if paper else {}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An async httpx.AsyncClient is created but never used in this function. The underlying openalex_searcher.get_paper_by_id method is synchronous. Remove this unused context manager.

Copilot uses AI. Check for mistakes.
Comment on lines +933 to +935
async with httpx.AsyncClient() as client:
paper = pmc_searcher.get_paper_by_pmcid(paper_id)
return paper.to_dict() if paper else {}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An async httpx.AsyncClient is created but never used. The underlying pmc_searcher.get_paper_by_pmcid method is synchronous. Remove this unused context manager.

Suggested change
async with httpx.AsyncClient() as client:
paper = pmc_searcher.get_paper_by_pmcid(paper_id)
return paper.to_dict() if paper else {}
paper = pmc_searcher.get_paper_by_pmcid(paper_id)
return paper.to_dict() if paper else {}

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +110
if any(a1 in a2 or a2 in a1 for a1 in authors1 for a2 in authors2):
return True

# Check author + year match
if paper1.authors and paper2.authors:
authors1 = [a.lower().strip() for a in paper1.authors]
authors2 = [a.lower().strip() for a in paper2.authors]
# Check if at least 2 authors match
matching_authors = sum(1 for a1 in authors1 for a2 in authors2 if a1 in a2 or a2 in a1)
if matching_authors >= 2:
# Check publication year
year1 = paper1.published_date.year if paper1.published_date else None
year2 = paper2.published_date.year if paper2.published_date else None
if year1 and year2 and year1 == year2:
return True
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The author matching logic has O(n*m) complexity where it checks all author pairs. For papers with many authors, this could be inefficient. Consider converting authors to a set and using set operations for faster matching, or at least short-circuit after finding the first match.

Suggested change
if any(a1 in a2 or a2 in a1 for a1 in authors1 for a2 in authors2):
return True
# Check author + year match
if paper1.authors and paper2.authors:
authors1 = [a.lower().strip() for a in paper1.authors]
authors2 = [a.lower().strip() for a in paper2.authors]
# Check if at least 2 authors match
matching_authors = sum(1 for a1 in authors1 for a2 in authors2 if a1 in a2 or a2 in a1)
if matching_authors >= 2:
# Check publication year
year1 = paper1.published_date.year if paper1.published_date else None
year2 = paper2.published_date.year if paper2.published_date else None
if year1 and year2 and year1 == year2:
return True
for a1 in authors1:
for a2 in authors2:
if a1 in a2 or a2 in a1:
return True
# Check author + year match
if paper1.authors and paper2.authors:
# Check publication year first; if years don't match, author matches can't make them the same paper
year1 = paper1.published_date.year if paper1.published_date else None
year2 = paper2.published_date.year if paper2.published_date else None
if not (year1 and year2 and year1 == year2):
return False
authors1 = [a.lower().strip() for a in paper1.authors]
authors2 = [a.lower().strip() for a in paper2.authors]
# Check if at least 2 authors match, short-circuiting once the threshold is reached
matching_authors = 0
for a1 in authors1:
for a2 in authors2:
if a1 in a2 or a2 in a1:
matching_authors += 1
if matching_authors >= 2:
return True

Copilot uses AI. Check for mistakes.
response = self.session.get(fetch_url, params=params, timeout=30)
response.raise_for_status()

return response.text.decode('utf-8')
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.text is already a string, calling .decode('utf-8') on it will fail with AttributeError. The decode method should only be called on bytes. Either use response.text directly or response.content.decode('utf-8').

Suggested change
return response.text.decode('utf-8')
return response.text

Copilot uses AI. Check for mistakes.
# Sort by position and join
sorted_indices = sorted(index_to_word.keys())
abstract_text = " ".join([index_to_word[i] for i in sorted_indices])
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Suggested change
except:
except Exception:

Copilot uses AI. Check for mistakes.
month_int = int(month) if month and month.isdigit() else 1
day_int = int(day) if day and day.isdigit() else 1
published_date = datetime(int(year), month_int, day_int)
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Suggested change
except:
except (ValueError, TypeError):

Copilot uses AI. Check for mistakes.
if d.get("published_date"):
try:
published_date = datetime.fromisoformat(d["published_date"])
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if d.get("updated_date"):
try:
updated_date = datetime.fromisoformat(d["updated_date"])
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if pub_date:
try:
published_date = datetime.fromisoformat(pub_date.replace("Z", "+00:00"))
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except (ValueError, TypeError):
# Ignore invalid or unexpected publication date formats from the API

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants