feat(deduplication): add cross-source paper deduplication by hongkongkiwi · Pull Request #33 · openags/paper-search-mcp

hongkongkiwi · 2026-01-22T08:16:38Z

Add cross-source paper deduplication. Same papers often appear in multiple sources. Adds deduplicate_papers, merge_papers, find_duplicate_groups tools.

…ic paper search OpenAlex is a free and open catalog of the global research system with over 200M works. This integration provides: - Full paper search with advanced filters (year, type, full-text availability) - Citation and reference traversal (forward and backward citations) - Author-based search - Related papers discovery based on concepts and references - DOI and OpenAlex ID lookup - Comprehensive metadata including concepts, keywords, and open access info Features: - search_openalex: Main search function with filtering options - get_openalex_paper: Get paper by OpenAlex ID - get_openalex_paper_by_doi: Get paper by DOI - get_openalex_citations: Get papers that cite this work - get_openalex_references: Get papers referenced by this work - search_openalex_by_author: Search papers by author name - get_openalex_related: Find related papers - download_openalex: Download PDF from open access sources - read_openalex_paper: Extract text from PDF

…holar This enhancement adds comprehensive citation and reference functionality: Semantic Scholar enhancements: - get_semantic_citations: Get papers that cite this work (forward citations) - get_semantic_references: Get papers referenced by this work (backward citations) - get_semantic_related: Get related papers based on concepts and citations - search_semantic_by_author: Search papers by author name Note: OpenAlex already has full citation/reference support from the previous feature. These tools enable: - Citation graph traversal (forward and backward) - Related paper discovery - Author-based paper search - Comprehensive citation analysis

Sci-Hub provides access to millions of research papers behind paywalls. This integration exposes the existing Sci-Hub fetcher as an MCP tool. Features: - download_scihub: Download PDFs using DOI, PMID, or URL Note: - Sci-Hub operates in a legal gray area - Only use for legitimate research purposes - Ensure compliance with local laws and institution policies

Same papers often appear in multiple sources (arXiv, Semantic Scholar, etc.). This feature adds intelligent deduplication based on: - DOI matching (most reliable) - Title similarity (>= 90% match) - Author + year matching Features: - deduplicate_papers: Remove duplicates from paper list - merge_papers: Merge duplicates by combining metadata - find_duplicate_groups: Analyze duplicates without removing Use cases: - Combine results from multiple search sources - Remove duplicate papers from aggregated results - Merge complementary metadata from different sources

Copilot

Pull request overview

This PR adds cross-source paper deduplication functionality to handle papers appearing across multiple academic sources (arXiv, Semantic Scholar, OpenAlex, etc.). It introduces a comprehensive deduplication module with DOI matching, title similarity analysis, and author-year matching, plus integration with two new academic platforms: OpenAlex and Sci-Hub.

Changes:

Added deduplication module with three new tools: deduplicate_papers, merge_papers, and find_duplicate_groups
Integrated OpenAlex API with comprehensive search, citation, and author lookup functionality
Added Sci-Hub downloader for accessing paywalled papers
Extended Semantic Scholar integration with citation, reference, related paper, and author search methods

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 35 comments.

File	Description
paper_search_mcp/deduplication.py	New module implementing paper deduplication logic with DOI, title similarity, and author-year matching
paper_search_mcp/server.py	Added new MCP tools for deduplication, OpenAlex, Sci-Hub, and extended Semantic Scholar functionality
paper_search_mcp/academic_platforms/openalex.py	New OpenAlex API integration for searching, retrieving papers, citations, and author works
paper_search_mcp/academic_platforms/semantic.py	Extended with methods for citations, references, related papers, and author search

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T10:07:29Z

paper_search_mcp/deduplication.py

+        except:
+            pass


Bare except clause is too broad and will catch all exceptions including KeyboardInterrupt and SystemExit. Consider catching specific exception types like ValueError or TypeError instead.

Copilot · 2026-01-23T10:07:30Z

paper_search_mcp/deduplication.py

+    Useful for analyzing what duplicates exist before deciding how to handle them.
+


Bare except clause is too broad. Consider catching specific exceptions like Exception to avoid masking critical errors like KeyboardInterrupt and SystemExit. This is particularly important in a loop where silent failures could cause papers to be incorrectly skipped.

Copilot · 2026-01-23T10:07:30Z

paper_search_mcp/server.py

+    async with httpx.AsyncClient() as client:
+        papers = semantic_searcher.get_citations(paper_id, max_results)
+        return [paper.to_dict() for paper in papers] if papers else []


Unused client context: The httpx.AsyncClient is created but never used in these functions. The underlying searcher methods (get_citations, get_references, get_related_papers) don't accept or use an HTTP client parameter. Consider removing the unused async with statement or investigate if the client should be passed to the searcher methods.

Copilot · 2026-01-23T10:07:30Z

paper_search_mcp/server.py

+    async with httpx.AsyncClient() as client:
+        papers = openalex_searcher.search_by_author(author_name, max_results, **kwargs)
+        return [paper.to_dict() for paper in papers] if papers else []


Unused client context: The httpx.AsyncClient is created but never used in this function. The underlying searcher method (search_by_author) doesn't accept or use an HTTP client parameter. Consider removing the unused async with statement.

Copilot · 2026-01-23T10:07:30Z

paper_search_mcp/server.py

+    for d in papers:
+        try:
+            paper_objs.append(dict_to_paper(d))
+        except Exception:


Bare except clause is too broad. Consider catching specific exceptions like Exception to avoid masking critical errors like KeyboardInterrupt and SystemExit.

Suggested change

except Exception:

except (KeyError, TypeError, ValueError):

Copilot · 2026-01-23T10:07:38Z

paper_search_mcp/deduplication.py

+- Author + year matching (tertiary method)
+"""
+from typing import List, Dict, Set, Tuple
+from collections import defaultdict


Import of 'defaultdict' is not used.

Suggested change

from collections import defaultdict

Copilot · 2026-01-23T10:07:38Z

paper_search_mcp/academic_platforms/openalex.py

+"""
+from typing import List, Optional
+from datetime import datetime
+import time


Import of 'time' is not used.

Suggested change

import time

Copilot · 2026-01-23T10:07:38Z

paper_search_mcp/deduplication.py

+    if d.get("published_date"):
+        try:
+            published_date = datetime.fromisoformat(d["published_date"])
+        except:


'except' clause does nothing but pass and there is no explanatory comment.

Copilot · 2026-01-23T10:07:38Z

paper_search_mcp/deduplication.py

+    if d.get("updated_date"):
+        try:
+            updated_date = datetime.fromisoformat(d["updated_date"])
+        except:


'except' clause does nothing but pass and there is no explanatory comment.

Copilot · 2026-01-23T10:07:39Z

paper_search_mcp/academic_platforms/openalex.py

+                except:
+                    pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

pass

except (ValueError, TypeError):

# Ignore malformed or unexpected publication dates; leave as None.

published_date = None

hongkongkiwi added 4 commits January 22, 2026 15:14

universea requested a review from Copilot January 23, 2026 09:58

Copilot started reviewing on behalf of universea January 23, 2026 09:58 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deduplication): add cross-source paper deduplication#33

feat(deduplication): add cross-source paper deduplication#33
hongkongkiwi wants to merge 4 commits intoopenags:mainfrom
hongkongkiwi:feature/deduplication

hongkongkiwi commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		Useful for analyzing what duplicates exist before deciding how to handle them.

-                except:
-                    pass
+                except (ValueError, TypeError):
+                    # Ignore malformed or unexpected publication dates; leave as None.
+                    published_date = None

Conversation

hongkongkiwi commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants