Skip to content

aion-labs/Pubmed-Networks

Repository files navigation

PubMed Co-citation Network Analysis

Build co-citation networks from PubMed literature with support for genes, cell types, and drugs. Analyze disease-entity relationships based on MeSH terms with date range filtering and organism-specific analysis including ortholog mapping.

Versions

This repository contains two versions of the pipeline:

Version 2 (v2/) - Multi-Entity Network Analysis ⭐ RECOMMENDED

  • Nodes: Genes, Cell Types (Cell Ontology), and Drugs (DrugBank)
  • Edges: Co-citation relationships between any entity types
  • Features: MeSH-to-Cell Ontology mapping, drug target genes, enhanced HTML visualizations
  • Use case: Comprehensive disease analysis including cellular and therapeutic context
  • Documentation: v2/README.md

Version 1 (v1/) - Gene-Only Network Analysis (Legacy)

  • Nodes: Genes only
  • Edges: Gene co-citation relationships
  • Features: MeSH-based or gene-centered networks, human-centric filtering
  • Use case: Gene-focused interaction analysis
  • Documentation: See below for v1 documentation

Quick Start

For Multi-Entity Analysis (v2 - Recommended)

See the v2/README.md for complete documentation.

# Build databases (one-time setup)
cd v2
bash build_databases.sh

# Run multi-entity analysis
bash run_multi_entity_analysis.sh "Psoriasis" 2014 2024

For Gene-Only Analysis (v1 - Legacy)

Continue reading below for v1 documentation.


Version 1 (v1/) - Gene Co-citation Network Analysis

Overview

This pipeline creates networks where:

  • Nodes are genes
  • Edges connect genes that appear in the same publications (co-citations)
  • Edge weights represent the number of shared publications

Features

  • Human-centric filtering: Focuses on human genes and genes with human orthologs
  • Extract MeSH term to PubMed mappings from local PubMed XML files
  • Download and parse NCBI gene data (gene2pubmed, gene_info, gene_orthologs)
  • Build networks from MeSH terms OR starting genes
  • Filter by publication year range
  • Filter by organism (human-only or map orthologs to human genes)
  • Apply minimum thresholds for papers per gene and papers per edge
  • Export to DOT format for visualization with Graphviz
  • Generate interactive HTML pages with links to PubMed

Repository Structure

pubmed-gene-network/
├── README.md                    # This file
├── requirements.txt             # Python dependencies (v1)
├── LICENSE                      # License information
├── v2/                          # Version 2: Multi-entity analysis ⭐
│   ├── README.md                # v2 documentation
│   ├── WORKFLOW.md              # Pipeline workflow guide
│   ├── USAGE_GUIDE.md           # Quick usage guide
│   ├── MULTI_ENTITY_BUILD_GUIDE.md  # Database build guide
│   ├── build_databases.sh       # One-time database setup
│   ├── run_multi_entity_analysis.sh  # Run analysis
│   ├── 0_build_mesh_cell_mapping.py  # MeSH to Cell Ontology
│   ├── 1_download_gene_data.py       # Download NCBI data
│   ├── 2_parse_gene_data.py          # Parse gene data
│   ├── 3_extract_mesh_pubmed_database.py  # Extract MeSH-PMID
│   ├── 4_build_cell_database.py      # Build cell type database
│   ├── 5_build_drug_database.py      # Build drug database
│   ├── 6_query_multi_entity_network.py  # Query network
│   ├── 7_export_dot.py               # Export to DOT format
│   ├── 8_generate_html.py            # Generate HTML visualization
│   └── utils.py                 # Shared utilities
├── v1/                          # Version 1: Gene-only analysis (legacy)
│   ├── utils.py                 # Shared utility functions
│   ├── run_analysis.py          # Master script (recommended)
│   ├── 0_build_gene_pmid_filter.py    # Build human-centric filters
│   ├── 1_extract_mesh_pubmed.py       # Extract MeSH-PMID mappings
│   ├── 2_download_gene_data.py        # Download NCBI gene data
│   ├── 3_parse_gene_data.py           # Parse gene data to database
│   ├── 4_build_network.py             # Build MeSH-based network
│   ├── 5_export_dot.py                # Export network to DOT format
│   ├── 6_generate_html.py             # Generate HTML visualization
│   ├── build_gene_interaction_network.py  # Build gene-based network
│   └── query_notochord.py             # Query genes for specific terms
├── docs/                        # v1 Documentation
│   ├── README.md                # Original detailed documentation
│   ├── USAGE_GUIDE.md           # Quick usage guide
│   ├── README_GENE_NETWORK.md   # Gene network documentation
│   └── FILTERING_STRATEGY.md    # Database filtering strategy
└── examples/                    # v1 Example scripts
    └── example_usage.sh         # Example commands

Quick Start

Installation

  1. Clone or download this repository
  2. Install dependencies:
pip install -r requirements.txt
  1. Download PubMed baseline XML files:

    • Visit: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
    • Download all XML files (.xml.gz format)
    • Place them in /data/pubmed-baseline/ or your preferred location
    • Note: This is a large download (~1000+ files, several hundred GB)
    • You can also use the updatefiles if you want more recent data
  2. Create required directories:

mkdir -p /data /results /scratch
  1. Update the PubMed XML path in v1/utils.py if you placed files in a different location

First Time Setup

Run the master script with --setup to create the databases:

python v1/run_analysis.py --setup

This will:

  • Build gene and PMID filters (Step 0)
  • Extract MeSH-PubMed mappings from XML files (Step 1, ~2-3 hours)
  • Download NCBI gene data files (Step 2)
  • Parse gene data into database (Step 3)

You can check setup status at any time:

python v1/run_analysis.py --check-setup

Generate a Network

Once setup is complete, you can generate networks in two ways:

Option 1: MeSH-based Network (Topic-focused)

python v1/run_analysis.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

Option 2: Gene-based Network (Gene-centered)

python v1/build_gene_interaction_network.py \
    --gene IL17A \
    --year-start 2014 \
    --year-end 2024 \
    --organism map_to_human \
    --min-papers-per-gene 5 \
    --min-papers-per-edge 3

Visualize Results

Both network types produce compatible JSON output. Generate visualizations:

# Generate interactive HTML
python v1/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json

# Generate DOT file for Graphviz
python v1/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json
dot -Tsvg network.dot -o network.svg

Usage Modes

MeSH-based Networks

Starting from a disease or topic:

  • Input: MeSH term (e.g., "Chordoma", "Breast Neoplasms")
  • Output: Network of genes co-cited in papers tagged with that MeSH term
  • Use case: Topic-focused research, disease-gene associations

Gene-based Networks

Starting from a single gene:

  • Input: Gene symbol (e.g., "IL17A", "TNF", "TP53")
  • Output: Network of genes co-cited with the starting gene
  • Use case: Gene-centered analysis, finding interaction partners

Key Parameters

Common Parameters

  • --year-start YEAR - Start year for publication filter
  • --year-end YEAR - End year for publication filter
  • --organism MODE - Organism filtering mode:
    • human_only: Only human genes (tax_id=9606)
    • map_to_human: Include all species, map to human orthologs (default)
  • --min-papers-per-gene N - Minimum publications per gene (default: 2-5)
  • --min-papers-per-edge N - Minimum shared publications for edge (default: 2-3)

MeSH-specific

  • --mesh TERM - MeSH term to analyze (required)

Gene-specific

  • --gene SYMBOL - Starting gene symbol (required)
  • --exclude-seed - Exclude starting gene from network (optional)

Database Filtering Strategy

To optimize performance and focus on relevant literature, the pipeline implements human-centric filtering:

  1. Gene filtering: Includes human genes (~194K) plus genes with human orthologs (~9.5M)
  2. PMID filtering: Only extracts PubMed articles that reference filtered genes (~1.27M PMIDs)
  3. Benefits: Reduces database size by ~97%, speeds up processing, maintains complete coverage of human gene literature

See docs/FILTERING_STRATEGY.md for detailed information.

Output Files

Networks are saved to /results/{network_name}/:

  • network_data.json - Complete network data with metadata
  • nodes.csv - List of genes with paper counts
  • edges.csv - Network edges with shared paper counts
  • index.html - Interactive visualization (after running 6_generate_html.py)
  • network.dot - Graphviz file (after running 5_export_dot.py)

Documentation

Detailed documentation is available in the docs/ directory:

Examples

See examples/example_usage.sh for various usage examples.

Example 1: Chordoma Network (Last 10 Years)

python v1/run_analysis.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

Example 2: IL17A Interaction Network

python v1/build_gene_interaction_network.py \
    --gene IL17A \
    --year-start 2014 \
    --year-end 2024 \
    --organism map_to_human \
    --min-papers-per-gene 5 \
    --min-papers-per-edge 3

Example 3: TNF Network (Recent, Human Only)

python v1/build_gene_interaction_network.py \
    --gene TNF \
    --year-start 2020 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 10 \
    --min-papers-per-edge 5

Requirements

  • Python 3.6+
  • Standard library modules (sqlite3, gzip, xml.etree.ElementTree, json, pathlib, etc.)
  • Optional: tqdm (for progress bars during setup)

See requirements.txt for complete list.

Data Requirements

Required Input Data

  1. PubMed XML files: PubMed baseline/updatefiles snapshot

  2. NCBI Gene data files: Downloaded automatically by setup script

    • gene2pubmed.gz - Gene to PubMed mappings
    • gene_info.gz - Gene symbols and names
    • gene_orthologs.gz - Ortholog relationships

Created Database Files

Setup creates these files in /data/:

  • human_centric_genes.txt - Filtered gene IDs (92.5 MB)
  • human_centric_pmids.txt - Filtered PubMed IDs (10.8 MB)
  • mesh_pubmed.db - SQLite: MeSH→PMID, years, titles
  • gene_pubmed.db - SQLite: Gene→PMID, orthologs

Performance

  • Setup time: ~2-3 hours (one-time, with filtering)
  • Network generation: Minutes to hours depending on parameters
  • Database size: Significantly reduced with human-centric filtering (~97% reduction)

Tips for Best Results

  1. Start broad, then narrow: Begin with default parameters, then increase thresholds to reduce network size
  2. Use year filters: Recent papers may show emerging interactions
  3. Compare organism modes: map_to_human provides more coverage, human_only is more specific
  4. Adjust thresholds: For highly studied genes/topics, increase thresholds to focus on strongest interactions
  5. Visualize iteratively: Generate HTML first for quick viewing, then create publication-quality DOT graphics

Common Taxonomy IDs

  • Human: 9606
  • Mouse: 10090
  • Rat: 10116
  • Zebrafish: 7955
  • Fly (D. melanogaster): 7227
  • Worm (C. elegans): 6239
  • Yeast (S. cerevisiae): 4932

Troubleshooting

Missing database files: Ensure setup completed successfully with python v1/run_analysis.py --check-setup

No results: Try relaxing filters (lower thresholds) or check MeSH term/gene symbol spelling

Memory issues: Process smaller date ranges or increase min-papers filters

Gene symbol not found: Check spelling and case (usually uppercase). The script will suggest similar symbols if found.

References

Contributing

Contributions are welcome! Please ensure:

  • Code follows existing style and structure
  • Documentation is updated for new features
  • Test scripts are provided for significant changes

License

[Specify your license here]

Citation

If you use this tool in your research, please cite: [Add citation information if applicable]

Contact

[Add contact information or link to issues page]

Acknowledgments

This tool uses data from:

  • NCBI PubMed
  • NCBI Gene database
  • MeSH (Medical Subject Headings)

About

A package to build a network of genes from the literature based on Mesh-Pubmed-Gene mapping

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors