arXivFlow is a powerful Python-based automation tool designed to streamline the research paper discovery and tracking process. It autonomously fetches metadata from arXiv, performs AI-driven analysis using Ollama or the Gemini API, and synchronizes the results with Google Sheets and local databases.
- Asynchronous API: Fully rewritten with
asynciofor high-performance paper retrieval and PDF processing. - Automated Retrieval: Fetch the latest papers from specific arXiv categories (e.g.,
cs.AI,cs.LG,hep-ph) within any date range. - AI Analysis Options: Uses Ollama models for local/private extraction or Gemini models for cloud-backed extraction of keywords and contact information (emails/affiliations).
- Intelligent PDF Handling: Automatically downloads PDFs and extracts text for deep analysis. Supports custom storage paths and atomic PDF writes.
- Robust arXiv Requests: Built-in compliance with arXiv's API guidelines (3-second request intervals), paged metadata retrieval, 429 cooldown handling, retry backoff, and duplicate-result cleanup.
- Multi-Format Export: Save your research data to CSV, JSON, Excel, or SQLite for flexible offline analysis.
- Google Sheets Sync: Seamlessly push compiled research data to a shared Google Sheet for team collaboration.
- Type-Safe & Modular: Clean, documented Python code with full type hinting and a class-based architecture.
- Python 3.13+: Ensure you have a modern Python environment.
- Choose an AI backend:
- For Ollama, install Ollama and download the required model (e.g., Llama 3.2):
ollama pull llama3.2
- For Gemini, create a Gemini API key and either pass it as
gemini_api_keyor set it asGOOGLE_AI_API.
- For Ollama, install Ollama and download the required model (e.g., Llama 3.2):
- Google Cloud Credentials for Google Sheets sync:
- Enable the Google Sheets and Google Drive APIs.
- Create a Service Account and download the JSON key as
credentials.json. - Ensure the service account has 'Editor' permissions on the sheet.
pip install arxivflow-
Clone the repository:
git clone https://github.com/zjzhao/arXivFlow.git cd arXivFlow -
Set up virtual environment:
python -m venv . source bin/activate # On Windows: Scripts\activate
-
Install dependencies:
pip install -e .
import asyncio
import datetime
from arxivflow import arXivFlow
async def main():
# 1. Initialize the flow with Ollama
flow = arXivFlow(
categories=["cs.AI", "cs.CV"],
ollama_model="llama3.2",
max_results=20,
start_date=datetime.datetime.now() - datetime.timedelta(days=7),
request_timeout=60.0
)
# 2. Fetch data & Extract info (Keywords/Contacts)
df = await flow.get_arxiv_data(download_pdfs=True)
# 3. Save to your preferred formats
flow.save_to_csv("my_research.csv")
flow.save_to_sqlite("research.db")
# 4. Sync with Google Sheets
flow.save_to_google_sheet(
sheet_id="YOUR_SHEET_ID",
credentials_file="credentials.json"
)
# 5. Close the client
await flow.close()
if __name__ == "__main__":
asyncio.run(main())import asyncio
import datetime
import os
from arxivflow import arXivFlow
async def main():
flow = arXivFlow(
categories=["cs.AI", "cs.CV"],
gemini_model="gemini-2.5-flash",
gemini_api_key=os.getenv("GOOGLE_AI_API"),
max_results=20,
start_date=datetime.datetime.now() - datetime.timedelta(days=7),
)
df = await flow.get_arxiv_data(download_pdfs=True)
flow.save_to_csv("my_research.csv")
await flow.close()
if __name__ == "__main__":
asyncio.run(main())If both ollama_model and gemini_model are provided, Ollama takes precedence. When gemini_model is set, a Gemini API key is required; pass gemini_api_key directly or set the GOOGLE_AI_API environment variable.
arXiv can occasionally return slow responses, rate limits, or temporary service errors. arXivFlow now makes the request path more stable by:
- Fetching arXiv metadata in smaller pages instead of relying on one large request.
- Fetching metadata for all requested categories before starting PDF downloads, which avoids PDF download bursts interfering with the next category query.
- Serializing arXiv requests and preserving the recommended 3-second interval.
- Retrying transient failures (
429,500,502,503,504, timeouts, and network errors) with exponential backoff and jitter. - Applying a longer cooldown after
429rate-limit responses before making the next arXiv request. - Respecting
Retry-Afterheaders when arXiv provides them. - Using a default 60-second request timeout, configurable with
request_timeout. - Writing PDFs to temporary
.partfiles first, then atomically replacing the final file only after validating PDF-like content. - Deduplicating merged output by
arXiv ID.
For especially large date ranges, prefer smaller max_results values or narrower date windows. arXivFlow will page requests internally, but smaller slices are still easier for arXiv and more reliable in practice.
The project follows a modular structure for easy extension:
src/arxivflow/arxivflow.py: The main orchestrator class (arXivFlow).src/arxivflow/ollama_functions.py: Local LLM interface using the Ollama API.src/arxivflow/gemini_functions.py: Gemini API interface for cloud-backed keyword and contact extraction.src/arxivflow/arxiv_functions.py: Asynchronous arXiv API interaction layer, including paging, rate limiting, retries, and PDF downloads.src/arxivflow/categories.py: arXiv category definitions.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request