Skip to content

This script helps compare an institutions publications from OpenAlex and Pure by identifying publications missing in Pure based on their DOIs.

License

Notifications You must be signed in to change notification settings

svidmar/OpenAlex_Pure_Pubfinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenAlex-Pure Publication Finder

A Python script to identify publications in OpenAlex that are missing from your Elsevier Pure research information system.

Overview

This tool compares publications indexed in OpenAlex for your institution(s) against those in your Pure system, generating an Excel report of potentially missing publications that may need to be added to Pure.

Features

  • ✅ Fetches publications from Pure API with robust error handling
  • ✅ Queries OpenAlex by ROR ID(s) and publication year
  • ✅ Supports multiple ROR IDs for institutions with multiple identifiers
  • ✅ Filters OpenAlex results to show only authors from your institution(s)
  • ✅ Generates detailed Excel report with:
    • DOI, title, authors, affiliations, ORCID
    • Publication year and date
    • Open access status and Unpaywall links (if found)
    • PDF URLs where available
  • ✅ Progress saving and resume capability
  • ✅ Comprehensive logging for debugging

Recent Improvements

Version 2.0 includes stability and reliability enhancements:

  • Retry Logic: Automatically retries failed API requests with exponential backoff
  • Progress Saving: Saves progress every 10 pages - resume from where you left off if interrupted
  • Better Error Handling: Gracefully handles Pure API inconsistencies and JSON parsing errors
  • Multiple ROR Support: Now supports institutions with multiple ROR identifiers
  • Improved Validation: Detects incomplete fetches and warns about suspicious results
  • Enhanced Logging: Detailed timestamped logs for easier debugging and monitoring

Requirements

  • Python 3.7+
  • Pure API access (API key required)
  • ROR ID(s) for your institution(s)

Installation

  1. Clone this repository:
git clone https://github.com/yourusername/OpenAlex-Pure-Pubfinder.git
cd OpenAlex-Pure-Pubfinder
  1. Install required packages:
pip install -r requirements.txt
  1. Create your configuration file:
cp .env.template .env
  1. Edit .env with your settings:
# Pure API Configuration
PURE_API_URL=https://your-institution.elsevierpure.com/ws/api/524/research-outputs
PURE_API_KEY=your_api_key_here
PURE_PUBLISHED_AFTER=2023-12-31T00:00:00.000Z

# OpenAlex Configuration (ROR IDs can be comma-separated for multiple institutions)
ROR_ID=https://ror.org/your_ror_id
FROM_YEAR=2024
TO_YEAR=2024

# Output Configuration
OUTPUT_FILE=/path/to/output/missing_publications.xlsx

Usage

Basic Usage

python pubfinder.py

With Environment Wrapper (Recommended)

python run_pubfinder.py

The script will:

  1. Fetch all publications from Pure matching your date filter
  2. Query OpenAlex for publications from your institution(s)
  3. Compare DOIs to identify missing publications
  4. Generate an Excel report with details

For Automated/Cron Jobs

Set AUTO_RESUME=true in your .env file to skip the resume prompt:

# In .env
AUTO_RESUME=true

Then add to crontab:

# Run daily at 2 AM
0 2 * * * cd /path/to/OpenAlex-Pure-Pubfinder && /usr/bin/python3 run_pubfinder.py >> logs/cron.log 2>&1

Configuration Details

ROR IDs

Your institution's ROR (Research Organization Registry) ID can be found at ror.org.

  • Single institution: ROR_ID=https://ror.org/04m5j1k67
  • Multiple institutions: ROR_ID=https://ror.org/04m5j1k67,https://ror.org/02jk5qe80

Date Filtering

The PURE_PUBLISHED_AFTER parameter filters Pure publications. Use ISO 8601 format:

PURE_PUBLISHED_AFTER=2023-12-31T00:00:00.000Z

Year Range

FROM_YEAR and TO_YEAR control the OpenAlex query:

FROM_YEAR=2024
TO_YEAR=2024

Output

The script generates an Excel file with the following columns:

Column Description
DOI Digital Object Identifier
Title Publication title
Authors (My Institution) Authors affiliated with your institution(s)
Affiliations (My Institution) Raw affiliation strings for your authors
ORCID (My Institution) ORCID identifiers for your authors
Publication Year Year of publication
Publication Date Full publication date
Is OA Whether the publication is Open Access
OA Status Open access status (gold, green, hybrid, etc.)
OA URL URL to open access version
Accepted Whether an accepted manuscript version exists
Published Whether a published version exists
License Publication license
PDF URL Direct link to PDF (if available)
Type Publication type
Source Journal/conference name
Link Clickable DOI link

Troubleshooting

Script Stops Unexpectedly

The script saves progress every 10 pages. If interrupted, simply run it again and choose "yes" when asked to resume.

No Authors Showing in Output

Make sure your ROR IDs in .env are correct and match the format used by OpenAlex (full URLs like https://ror.org/...).

Pure API Errors

If you encounter Pure API errors:

  1. Check your API key is valid
  2. Verify the API URL is correct
  3. As a Pure admin, try running the ContentCorrectionJob in Pure (Admin → System → Jobs) to fix data integrity issues

Empty Results

Check that:

  • Your date filters are correct
  • Publications exist in both systems for the specified criteria
  • Your ROR ID matches publications in OpenAlex

Logging

All runs are logged to timestamped files in the logs/ directory:

logs/pubfinder_YYYYMMDD_HHMMSS.log

Check these logs for detailed information about each run.

How It Works

  1. Pure Fetch: Retrieves all research outputs from Pure matching the date filter
  2. DOI Extraction: Extracts and normalizes DOIs from Pure publications
  3. OpenAlex Query: Fetches publications from OpenAlex filtered by:
    • Institution ROR ID(s)
    • Publication year range
  4. Author Filtering: Identifies which authors on each OpenAlex publication are from your institution(s)
  5. Comparison: Matches DOIs between systems
  6. Report Generation: Creates Excel file with publications found in OpenAlex but missing in Pure

Notes

  • The script uses DOI matching, so publications without DOIs won't be matched
  • The script may produce false positives if publication years differ between OpenAlex and Pure; therefore, using a range is recommended.

Support

For issues or questions:

  1. Check the logs in the logs/ directory
  2. Review the troubleshooting section above
  3. Open an issue on GitHub

License

See LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

About

This script helps compare an institutions publications from OpenAlex and Pure by identifying publications missing in Pure based on their DOIs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages