A Python script to identify publications in OpenAlex that are missing from your Elsevier Pure research information system.
This tool compares publications indexed in OpenAlex for your institution(s) against those in your Pure system, generating an Excel report of potentially missing publications that may need to be added to Pure.
- ✅ Fetches publications from Pure API with robust error handling
- ✅ Queries OpenAlex by ROR ID(s) and publication year
- ✅ Supports multiple ROR IDs for institutions with multiple identifiers
- ✅ Filters OpenAlex results to show only authors from your institution(s)
- ✅ Generates detailed Excel report with:
- DOI, title, authors, affiliations, ORCID
- Publication year and date
- Open access status and Unpaywall links (if found)
- PDF URLs where available
- ✅ Progress saving and resume capability
- ✅ Comprehensive logging for debugging
Version 2.0 includes stability and reliability enhancements:
- Retry Logic: Automatically retries failed API requests with exponential backoff
- Progress Saving: Saves progress every 10 pages - resume from where you left off if interrupted
- Better Error Handling: Gracefully handles Pure API inconsistencies and JSON parsing errors
- Multiple ROR Support: Now supports institutions with multiple ROR identifiers
- Improved Validation: Detects incomplete fetches and warns about suspicious results
- Enhanced Logging: Detailed timestamped logs for easier debugging and monitoring
- Python 3.7+
- Pure API access (API key required)
- ROR ID(s) for your institution(s)
- Clone this repository:
git clone https://github.com/yourusername/OpenAlex-Pure-Pubfinder.git
cd OpenAlex-Pure-Pubfinder- Install required packages:
pip install -r requirements.txt- Create your configuration file:
cp .env.template .env- Edit
.envwith your settings:
# Pure API Configuration
PURE_API_URL=https://your-institution.elsevierpure.com/ws/api/524/research-outputs
PURE_API_KEY=your_api_key_here
PURE_PUBLISHED_AFTER=2023-12-31T00:00:00.000Z
# OpenAlex Configuration (ROR IDs can be comma-separated for multiple institutions)
ROR_ID=https://ror.org/your_ror_id
FROM_YEAR=2024
TO_YEAR=2024
# Output Configuration
OUTPUT_FILE=/path/to/output/missing_publications.xlsxpython pubfinder.pypython run_pubfinder.pyThe script will:
- Fetch all publications from Pure matching your date filter
- Query OpenAlex for publications from your institution(s)
- Compare DOIs to identify missing publications
- Generate an Excel report with details
Set AUTO_RESUME=true in your .env file to skip the resume prompt:
# In .env
AUTO_RESUME=trueThen add to crontab:
# Run daily at 2 AM
0 2 * * * cd /path/to/OpenAlex-Pure-Pubfinder && /usr/bin/python3 run_pubfinder.py >> logs/cron.log 2>&1Your institution's ROR (Research Organization Registry) ID can be found at ror.org.
- Single institution:
ROR_ID=https://ror.org/04m5j1k67 - Multiple institutions:
ROR_ID=https://ror.org/04m5j1k67,https://ror.org/02jk5qe80
The PURE_PUBLISHED_AFTER parameter filters Pure publications. Use ISO 8601 format:
PURE_PUBLISHED_AFTER=2023-12-31T00:00:00.000Z
FROM_YEAR and TO_YEAR control the OpenAlex query:
FROM_YEAR=2024
TO_YEAR=2024
The script generates an Excel file with the following columns:
| Column | Description |
|---|---|
| DOI | Digital Object Identifier |
| Title | Publication title |
| Authors (My Institution) | Authors affiliated with your institution(s) |
| Affiliations (My Institution) | Raw affiliation strings for your authors |
| ORCID (My Institution) | ORCID identifiers for your authors |
| Publication Year | Year of publication |
| Publication Date | Full publication date |
| Is OA | Whether the publication is Open Access |
| OA Status | Open access status (gold, green, hybrid, etc.) |
| OA URL | URL to open access version |
| Accepted | Whether an accepted manuscript version exists |
| Published | Whether a published version exists |
| License | Publication license |
| PDF URL | Direct link to PDF (if available) |
| Type | Publication type |
| Source | Journal/conference name |
| Link | Clickable DOI link |
The script saves progress every 10 pages. If interrupted, simply run it again and choose "yes" when asked to resume.
Make sure your ROR IDs in .env are correct and match the format used by OpenAlex (full URLs like https://ror.org/...).
If you encounter Pure API errors:
- Check your API key is valid
- Verify the API URL is correct
- As a Pure admin, try running the ContentCorrectionJob in Pure (Admin → System → Jobs) to fix data integrity issues
Check that:
- Your date filters are correct
- Publications exist in both systems for the specified criteria
- Your ROR ID matches publications in OpenAlex
All runs are logged to timestamped files in the logs/ directory:
logs/pubfinder_YYYYMMDD_HHMMSS.log
Check these logs for detailed information about each run.
- Pure Fetch: Retrieves all research outputs from Pure matching the date filter
- DOI Extraction: Extracts and normalizes DOIs from Pure publications
- OpenAlex Query: Fetches publications from OpenAlex filtered by:
- Institution ROR ID(s)
- Publication year range
- Author Filtering: Identifies which authors on each OpenAlex publication are from your institution(s)
- Comparison: Matches DOIs between systems
- Report Generation: Creates Excel file with publications found in OpenAlex but missing in Pure
- The script uses DOI matching, so publications without DOIs won't be matched
- The script may produce false positives if publication years differ between OpenAlex and Pure; therefore, using a range is recommended.
For issues or questions:
- Check the logs in the
logs/directory - Review the troubleshooting section above
- Open an issue on GitHub
See LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Uses the OpenAlex API for publication data
- Works with Elsevier Pure research information systems
- ROR IDs from Research Organization Registry