Skip to content

zjzhao1002/arXivFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXivFlow 🚀

License: MIT Python 3.13+ Static Badge Ollama Gemini arXiv

arXivFlow is a powerful Python-based automation tool designed to streamline the research paper discovery and tracking process. It autonomously fetches metadata from arXiv, performs AI-driven analysis using Ollama or the Gemini API, and synchronizes the results with Google Sheets and local databases.


✨ Features

  • Asynchronous API: Fully rewritten with asyncio for high-performance paper retrieval and PDF processing.
  • Automated Retrieval: Fetch the latest papers from specific arXiv categories (e.g., cs.AI, cs.LG, hep-ph) within any date range.
  • AI Analysis Options: Uses Ollama models for local/private extraction or Gemini models for cloud-backed extraction of keywords and contact information (emails/affiliations).
  • Intelligent PDF Handling: Automatically downloads PDFs and extracts text for deep analysis. Supports custom storage paths and atomic PDF writes.
  • Robust arXiv Requests: Built-in compliance with arXiv's API guidelines (3-second request intervals), paged metadata retrieval, 429 cooldown handling, retry backoff, and duplicate-result cleanup.
  • Multi-Format Export: Save your research data to CSV, JSON, Excel, or SQLite for flexible offline analysis.
  • Google Sheets Sync: Seamlessly push compiled research data to a shared Google Sheet for team collaboration.
  • Type-Safe & Modular: Clean, documented Python code with full type hinting and a class-based architecture.

🛠️ Prerequisites

  1. Python 3.13+: Ensure you have a modern Python environment.
  2. Choose an AI backend:
    • For Ollama, install Ollama and download the required model (e.g., Llama 3.2):
      ollama pull llama3.2
    • For Gemini, create a Gemini API key and either pass it as gemini_api_key or set it as GOOGLE_AI_API.
  3. Google Cloud Credentials for Google Sheets sync:
    • Enable the Google Sheets and Google Drive APIs.
    • Create a Service Account and download the JSON key as credentials.json.
    • Ensure the service account has 'Editor' permissions on the sheet.

🚀 Installation

From PyPI (Recommended)

pip install arxivflow

From Source (For Development)

  1. Clone the repository:

    git clone https://github.com/zjzhao/arXivFlow.git
    cd arXivFlow
  2. Set up virtual environment:

    python -m venv .
    source bin/activate  # On Windows: Scripts\activate
  3. Install dependencies:

    pip install -e .

📖 Usage

Quick Start (Async)

import asyncio
import datetime
from arxivflow import arXivFlow

async def main():
    # 1. Initialize the flow with Ollama
    flow = arXivFlow(
        categories=["cs.AI", "cs.CV"], 
        ollama_model="llama3.2",
        max_results=20,
        start_date=datetime.datetime.now() - datetime.timedelta(days=7),
        request_timeout=60.0
    )

    # 2. Fetch data & Extract info (Keywords/Contacts)
    df = await flow.get_arxiv_data(download_pdfs=True)

    # 3. Save to your preferred formats
    flow.save_to_csv("my_research.csv")
    flow.save_to_sqlite("research.db")

    # 4. Sync with Google Sheets
    flow.save_to_google_sheet(
        sheet_id="YOUR_SHEET_ID", 
        credentials_file="credentials.json"
    )
    
    # 5. Close the client
    await flow.close()

if __name__ == "__main__":
    asyncio.run(main())

Gemini Backend

import asyncio
import datetime
import os
from arxivflow import arXivFlow

async def main():
    flow = arXivFlow(
        categories=["cs.AI", "cs.CV"],
        gemini_model="gemini-2.5-flash",
        gemini_api_key=os.getenv("GOOGLE_AI_API"),
        max_results=20,
        start_date=datetime.datetime.now() - datetime.timedelta(days=7),
    )

    df = await flow.get_arxiv_data(download_pdfs=True)
    flow.save_to_csv("my_research.csv")
    await flow.close()

if __name__ == "__main__":
    asyncio.run(main())

If both ollama_model and gemini_model are provided, Ollama takes precedence. When gemini_model is set, a Gemini API key is required; pass gemini_api_key directly or set the GOOGLE_AI_API environment variable.


🧱 Request Stability

arXiv can occasionally return slow responses, rate limits, or temporary service errors. arXivFlow now makes the request path more stable by:

  • Fetching arXiv metadata in smaller pages instead of relying on one large request.
  • Fetching metadata for all requested categories before starting PDF downloads, which avoids PDF download bursts interfering with the next category query.
  • Serializing arXiv requests and preserving the recommended 3-second interval.
  • Retrying transient failures (429, 500, 502, 503, 504, timeouts, and network errors) with exponential backoff and jitter.
  • Applying a longer cooldown after 429 rate-limit responses before making the next arXiv request.
  • Respecting Retry-After headers when arXiv provides them.
  • Using a default 60-second request timeout, configurable with request_timeout.
  • Writing PDFs to temporary .part files first, then atomically replacing the final file only after validating PDF-like content.
  • Deduplicating merged output by arXiv ID.

For especially large date ranges, prefer smaller max_results values or narrower date windows. arXivFlow will page requests internally, but smaller slices are still easier for arXiv and more reliable in practice.


🏗️ Architecture

The project follows a modular structure for easy extension:

  • src/arxivflow/arxivflow.py: The main orchestrator class (arXivFlow).
  • src/arxivflow/ollama_functions.py: Local LLM interface using the Ollama API.
  • src/arxivflow/gemini_functions.py: Gemini API interface for cloud-backed keyword and contact extraction.
  • src/arxivflow/arxiv_functions.py: Asynchronous arXiv API interaction layer, including paging, rate limiting, retries, and PDF downloads.
  • src/arxivflow/categories.py: arXiv category definitions.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

About

arXivFlow is a powerful Python-based automation tool designed to streamline the research paper discovery and tracking process.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages