PapeRet: Research Papers Retrieval System

Overview

This project implements a comprehensive Information Retrieval (IR) system designed to retrieve research papers based on metadata, abstracts, and full-text data. The system includes robust indexing, query processing, ranking, and relevance evaluation features, integrated with advanced models like LLAMA for query augmentation and summarization. A user-friendly Streamlit app (PapeRet.py) enables interactive search and retrieval.

Features

Streamlit Interface:
- Search for research papers interactively.
- Optional LLAMA-enhanced query processing.
- AI-generated summaries of retrieved documents.
Indexing:
- BasicInvertedIndex for efficient term-document mappings.
- Dynamic computation of document statistics like term frequencies and document frequencies.
Query Processing:
- Tokenization and stopword removal.
- Augmented query processing with title index integration.
- Enhanced query understanding with author and year filters.
Ranking and Scoring:
- BM25 and TF-IDF relevance scoring algorithms.
- LLAMA-powered query-to-keyword extraction for improved ranking.
- Dynamic re-ranking based on query-matched authors and publication years.
Relevance Evaluation:
- Metrics such as MAP (Mean Average Precision) and NDCG (Normalized Discounted Cumulative Gain).
- Pre-curated relevance scores for benchmarking rankers.
Document Preprocessing:
- Regex tokenization.

Directory Structure

IR_Project/
├── all_papers_index/   
│   ├── index.json            # Main paper text index
├── all_papers_title_index/   
│   ├── index.json            # Title index
├── all_authors.pkl           # List of all authors
├── docid_authors_map.pkl     # Mapping of document IDs to authors
├── docid_link_map.pkl        # Mapping of document IDs to urls
├── docid_title_map.pkl       # Mapping of document IDs to titles
├── docid_abstract_map.pkl    # Mapping of document IDs to abstracts
├── docid_year_map.pkl        # Mapping of document IDs to publication years
├── stopwords.txt             # List of stopwords for filtering
├── PapeRet.py                # Streamlit app
├── main.py                   # System initialization
├── indexing.py               # Indexing strategies
├── document_preprocessor.py  # Tokenization and preprocessing
├── custom_ranker.py          # Ranking and scoring algorithms
├── relevance.py              # Relevance evaluation and testing
├── llama_tokenise_rag.py     # LLAMA integration for summarization and query extraction
├── scratch_and_other_prep.ipynb # Rough notebook used to create miscellaneous files and tests 
├── ArXiv/                    # Scripts for obtaining paper corpus and text from ArXiv.
├── OpenAlex/                 # Scripts for fetching metadata of seed papers, their references, and scraping text from 15 websites.
└── README.md                 # Project documentation

Installation

Clone the repository:

git clone https://github.com/yourusername/IR_Project.git
cd IR_Project

Install dependencies:
```
pip install -r requirements.txt
```
Set up necessary data files in the directory:
- Place .pkl, index files for mappings.
- Add stopwords.txt for stopword filtering.
- Download the files from this link.
Add login key for huggingface_hub for in setup_pipeline function in llama_tokenise_rag.py

Usage

Running the Streamlit App

To start the application, run the following command:

streamlit run PapeRet.py

The app interface allows you to:

Input search queries and retrieve results.
Toggle LLAMA integration for enhanced query processing
View and expand AI-generated summaries tailored to your query.

Results

The system achieves high precision and recall across multiple datasets. Evaluation results include:

Mean Average Precision (MAP) and NDCG metrics.

Contributors

Nilay Gautam
Rishikesh Ksheersagar

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PapeRet: Research Papers Retrieval System

Overview

Features

Directory Structure

Installation

Usage

Running the Streamlit App

Results

Contributors

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
ArXiv		ArXiv
OpenAlex		OpenAlex
.DS_Store		.DS_Store
LICENSE		LICENSE
PapeRet.pdf		PapeRet.pdf
PapeRet.py		PapeRet.py
README.md		README.md
custom_ranker.py		custom_ranker.py
document_preprocessor.py		document_preprocessor.py
indexing.py		indexing.py
llama_tokenise_rag.py		llama_tokenise_rag.py
main.py		main.py
relevance.py		relevance.py
requirements.txt		requirements.txt
scratch_and_other_prep.ipynb		scratch_and_other_prep.ipynb

License

nilaygautam2007/PapeRet

Folders and files

Latest commit

History

Repository files navigation

PapeRet: Research Papers Retrieval System

Overview

Features

Directory Structure

Installation

Usage

Running the Streamlit App

Results

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages