Skip to content

feat(deduplication): add cross-source paper deduplication#33

Open
hongkongkiwi wants to merge 4 commits intoopenags:mainfrom
hongkongkiwi:feature/deduplication
Open

feat(deduplication): add cross-source paper deduplication#33
hongkongkiwi wants to merge 4 commits intoopenags:mainfrom
hongkongkiwi:feature/deduplication

Conversation

@hongkongkiwi
Copy link

Add cross-source paper deduplication. Same papers often appear in multiple sources. Adds deduplicate_papers, merge_papers, find_duplicate_groups tools.

…ic paper search

OpenAlex is a free and open catalog of the global research system with over 200M works.
This integration provides:

- Full paper search with advanced filters (year, type, full-text availability)
- Citation and reference traversal (forward and backward citations)
- Author-based search
- Related papers discovery based on concepts and references
- DOI and OpenAlex ID lookup
- Comprehensive metadata including concepts, keywords, and open access info

Features:
- search_openalex: Main search function with filtering options
- get_openalex_paper: Get paper by OpenAlex ID
- get_openalex_paper_by_doi: Get paper by DOI
- get_openalex_citations: Get papers that cite this work
- get_openalex_references: Get papers referenced by this work
- search_openalex_by_author: Search papers by author name
- get_openalex_related: Find related papers
- download_openalex: Download PDF from open access sources
- read_openalex_paper: Extract text from PDF
…holar

This enhancement adds comprehensive citation and reference functionality:

Semantic Scholar enhancements:
- get_semantic_citations: Get papers that cite this work (forward citations)
- get_semantic_references: Get papers referenced by this work (backward citations)
- get_semantic_related: Get related papers based on concepts and citations
- search_semantic_by_author: Search papers by author name

Note: OpenAlex already has full citation/reference support from the previous feature.

These tools enable:
- Citation graph traversal (forward and backward)
- Related paper discovery
- Author-based paper search
- Comprehensive citation analysis
Sci-Hub provides access to millions of research papers behind paywalls.
This integration exposes the existing Sci-Hub fetcher as an MCP tool.

Features:
- download_scihub: Download PDFs using DOI, PMID, or URL

Note:
- Sci-Hub operates in a legal gray area
- Only use for legitimate research purposes
- Ensure compliance with local laws and institution policies
Same papers often appear in multiple sources (arXiv, Semantic Scholar, etc.).
This feature adds intelligent deduplication based on:
- DOI matching (most reliable)
- Title similarity (>= 90% match)
- Author + year matching

Features:
- deduplicate_papers: Remove duplicates from paper list
- merge_papers: Merge duplicates by combining metadata
- find_duplicate_groups: Analyze duplicates without removing

Use cases:
- Combine results from multiple search sources
- Remove duplicate papers from aggregated results
- Merge complementary metadata from different sources
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds cross-source paper deduplication functionality to handle papers appearing across multiple academic sources (arXiv, Semantic Scholar, OpenAlex, etc.). It introduces a comprehensive deduplication module with DOI matching, title similarity analysis, and author-year matching, plus integration with two new academic platforms: OpenAlex and Sci-Hub.

Changes:

  • Added deduplication module with three new tools: deduplicate_papers, merge_papers, and find_duplicate_groups
  • Integrated OpenAlex API with comprehensive search, citation, and author lookup functionality
  • Added Sci-Hub downloader for accessing paywalled papers
  • Extended Semantic Scholar integration with citation, reference, related paper, and author search methods

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 35 comments.

File Description
paper_search_mcp/deduplication.py New module implementing paper deduplication logic with DOI, title similarity, and author-year matching
paper_search_mcp/server.py Added new MCP tools for deduplication, OpenAlex, Sci-Hub, and extended Semantic Scholar functionality
paper_search_mcp/academic_platforms/openalex.py New OpenAlex API integration for searching, retrieving papers, citations, and author works
paper_search_mcp/academic_platforms/semantic.py Extended with methods for citations, references, related papers, and author search

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +357 to +358
except:
pass
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except clause is too broad and will catch all exceptions including KeyboardInterrupt and SystemExit. Consider catching specific exception types like ValueError or TypeError instead.

Copilot uses AI. Check for mistakes.
Comment on lines +407 to +408
Useful for analyzing what duplicates exist before deciding how to handle them.

Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except clause is too broad. Consider catching specific exceptions like Exception to avoid masking critical errors like KeyboardInterrupt and SystemExit. This is particularly important in a loop where silent failures could cause papers to be incorrectly skipped.

Copilot uses AI. Check for mistakes.
Comment on lines +363 to +365
async with httpx.AsyncClient() as client:
papers = semantic_searcher.get_citations(paper_id, max_results)
return [paper.to_dict() for paper in papers] if papers else []
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused client context: The httpx.AsyncClient is created but never used in these functions. The underlying searcher methods (get_citations, get_references, get_related_papers) don't accept or use an HTTP client parameter. Consider removing the unused async with statement or investigate if the client should be passed to the searcher methods.

Copilot uses AI. Check for mistakes.
Comment on lines +658 to +660
async with httpx.AsyncClient() as client:
papers = openalex_searcher.search_by_author(author_name, max_results, **kwargs)
return [paper.to_dict() for paper in papers] if papers else []
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused client context: The httpx.AsyncClient is created but never used in this function. The underlying searcher method (search_by_author) doesn't accept or use an HTTP client parameter. Consider removing the unused async with statement.

Copilot uses AI. Check for mistakes.
for d in papers:
try:
paper_objs.append(dict_to_paper(d))
except Exception:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except clause is too broad. Consider catching specific exceptions like Exception to avoid masking critical errors like KeyboardInterrupt and SystemExit.

Suggested change
except Exception:
except (KeyError, TypeError, ValueError):

Copilot uses AI. Check for mistakes.
- Author + year matching (tertiary method)
"""
from typing import List, Dict, Set, Tuple
from collections import defaultdict
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'defaultdict' is not used.

Suggested change
from collections import defaultdict

Copilot uses AI. Check for mistakes.
"""
from typing import List, Optional
from datetime import datetime
import time
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'time' is not used.

Suggested change
import time

Copilot uses AI. Check for mistakes.
if d.get("published_date"):
try:
published_date = datetime.fromisoformat(d["published_date"])
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if d.get("updated_date"):
try:
updated_date = datetime.fromisoformat(d["updated_date"])
except:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
Comment on lines +369 to +370
except:
pass
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except (ValueError, TypeError):
# Ignore malformed or unexpected publication dates; leave as None.
published_date = None

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants