feat(deduplication): add cross-source paper deduplication#33
feat(deduplication): add cross-source paper deduplication#33hongkongkiwi wants to merge 4 commits intoopenags:mainfrom
Conversation
…ic paper search OpenAlex is a free and open catalog of the global research system with over 200M works. This integration provides: - Full paper search with advanced filters (year, type, full-text availability) - Citation and reference traversal (forward and backward citations) - Author-based search - Related papers discovery based on concepts and references - DOI and OpenAlex ID lookup - Comprehensive metadata including concepts, keywords, and open access info Features: - search_openalex: Main search function with filtering options - get_openalex_paper: Get paper by OpenAlex ID - get_openalex_paper_by_doi: Get paper by DOI - get_openalex_citations: Get papers that cite this work - get_openalex_references: Get papers referenced by this work - search_openalex_by_author: Search papers by author name - get_openalex_related: Find related papers - download_openalex: Download PDF from open access sources - read_openalex_paper: Extract text from PDF
…holar This enhancement adds comprehensive citation and reference functionality: Semantic Scholar enhancements: - get_semantic_citations: Get papers that cite this work (forward citations) - get_semantic_references: Get papers referenced by this work (backward citations) - get_semantic_related: Get related papers based on concepts and citations - search_semantic_by_author: Search papers by author name Note: OpenAlex already has full citation/reference support from the previous feature. These tools enable: - Citation graph traversal (forward and backward) - Related paper discovery - Author-based paper search - Comprehensive citation analysis
Sci-Hub provides access to millions of research papers behind paywalls. This integration exposes the existing Sci-Hub fetcher as an MCP tool. Features: - download_scihub: Download PDFs using DOI, PMID, or URL Note: - Sci-Hub operates in a legal gray area - Only use for legitimate research purposes - Ensure compliance with local laws and institution policies
Same papers often appear in multiple sources (arXiv, Semantic Scholar, etc.). This feature adds intelligent deduplication based on: - DOI matching (most reliable) - Title similarity (>= 90% match) - Author + year matching Features: - deduplicate_papers: Remove duplicates from paper list - merge_papers: Merge duplicates by combining metadata - find_duplicate_groups: Analyze duplicates without removing Use cases: - Combine results from multiple search sources - Remove duplicate papers from aggregated results - Merge complementary metadata from different sources
There was a problem hiding this comment.
Pull request overview
This PR adds cross-source paper deduplication functionality to handle papers appearing across multiple academic sources (arXiv, Semantic Scholar, OpenAlex, etc.). It introduces a comprehensive deduplication module with DOI matching, title similarity analysis, and author-year matching, plus integration with two new academic platforms: OpenAlex and Sci-Hub.
Changes:
- Added deduplication module with three new tools:
deduplicate_papers,merge_papers, andfind_duplicate_groups - Integrated OpenAlex API with comprehensive search, citation, and author lookup functionality
- Added Sci-Hub downloader for accessing paywalled papers
- Extended Semantic Scholar integration with citation, reference, related paper, and author search methods
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 35 comments.
| File | Description |
|---|---|
| paper_search_mcp/deduplication.py | New module implementing paper deduplication logic with DOI, title similarity, and author-year matching |
| paper_search_mcp/server.py | Added new MCP tools for deduplication, OpenAlex, Sci-Hub, and extended Semantic Scholar functionality |
| paper_search_mcp/academic_platforms/openalex.py | New OpenAlex API integration for searching, retrieving papers, citations, and author works |
| paper_search_mcp/academic_platforms/semantic.py | Extended with methods for citations, references, related papers, and author search |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except: | ||
| pass |
There was a problem hiding this comment.
Bare except clause is too broad and will catch all exceptions including KeyboardInterrupt and SystemExit. Consider catching specific exception types like ValueError or TypeError instead.
| Useful for analyzing what duplicates exist before deciding how to handle them. | ||
|
|
There was a problem hiding this comment.
Bare except clause is too broad. Consider catching specific exceptions like Exception to avoid masking critical errors like KeyboardInterrupt and SystemExit. This is particularly important in a loop where silent failures could cause papers to be incorrectly skipped.
| async with httpx.AsyncClient() as client: | ||
| papers = semantic_searcher.get_citations(paper_id, max_results) | ||
| return [paper.to_dict() for paper in papers] if papers else [] |
There was a problem hiding this comment.
Unused client context: The httpx.AsyncClient is created but never used in these functions. The underlying searcher methods (get_citations, get_references, get_related_papers) don't accept or use an HTTP client parameter. Consider removing the unused async with statement or investigate if the client should be passed to the searcher methods.
| async with httpx.AsyncClient() as client: | ||
| papers = openalex_searcher.search_by_author(author_name, max_results, **kwargs) | ||
| return [paper.to_dict() for paper in papers] if papers else [] |
There was a problem hiding this comment.
Unused client context: The httpx.AsyncClient is created but never used in this function. The underlying searcher method (search_by_author) doesn't accept or use an HTTP client parameter. Consider removing the unused async with statement.
| for d in papers: | ||
| try: | ||
| paper_objs.append(dict_to_paper(d)) | ||
| except Exception: |
There was a problem hiding this comment.
Bare except clause is too broad. Consider catching specific exceptions like Exception to avoid masking critical errors like KeyboardInterrupt and SystemExit.
| except Exception: | |
| except (KeyError, TypeError, ValueError): |
| - Author + year matching (tertiary method) | ||
| """ | ||
| from typing import List, Dict, Set, Tuple | ||
| from collections import defaultdict |
There was a problem hiding this comment.
Import of 'defaultdict' is not used.
| from collections import defaultdict |
| """ | ||
| from typing import List, Optional | ||
| from datetime import datetime | ||
| import time |
There was a problem hiding this comment.
Import of 'time' is not used.
| import time |
| if d.get("published_date"): | ||
| try: | ||
| published_date = datetime.fromisoformat(d["published_date"]) | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if d.get("updated_date"): | ||
| try: | ||
| updated_date = datetime.fromisoformat(d["updated_date"]) | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| pass | |
| except (ValueError, TypeError): | |
| # Ignore malformed or unexpected publication dates; leave as None. | |
| published_date = None |
Add cross-source paper deduplication. Same papers often appear in multiple sources. Adds deduplicate_papers, merge_papers, find_duplicate_groups tools.