Skip to content

Latest commit

 

History

History
334 lines (248 loc) · 10.7 KB

File metadata and controls

334 lines (248 loc) · 10.7 KB

PubMed Gene Co-citation Network Analysis

Build gene co-citation networks from PubMed literature based on MeSH terms, with support for date range filtering and organism-specific analysis including ortholog mapping.

Overview

This pipeline creates networks where:

  • Nodes are genes
  • Edges connect genes that appear in the same publications (co-citations)
  • Edge weights represent the number of shared publications

Features

  • Human-centric filtering: Focuses on human genes and genes with human orthologs
  • Extract MeSH term to PubMed mappings from local PubMed XML files
  • Download and parse NCBI gene data (gene2pubmed, gene_info, gene_orthologs)
  • Filter by publication year range
  • Filter by organism (human-only or map orthologs to human genes)
  • Apply minimum thresholds for papers per gene and papers per edge
  • Export to DOT format for visualization
  • Generate interactive HTML pages with links to PubMed

Database Filtering Strategy

To optimize performance and focus on relevant literature, the pipeline implements human-centric filtering:

  1. Gene filtering: Includes human genes (~194K) plus genes with human orthologs (~9.5M)
  2. PMID filtering: Only extracts PubMed articles that reference filtered genes (~1.27M PMIDs)
  3. Benefits: Reduces database size by ~97%, speeds up processing, maintains complete coverage of human gene literature

See FILTERING_STRATEGY.md for detailed information.

Quick Start

Master Script (Recommended)

Use the all-in-one run_analysis.py script for easy setup and analysis:

# First time: Run setup (creates databases)
python /code/run_analysis.py --setup

# Generate a network
python /code/run_analysis.py --mesh "Chordoma" --year-start 2014 --year-end 2024

# With custom parameters
python /code/run_analysis.py --mesh "Breast Neoplasms" \
    --year-start 2010 --year-end 2023 \
    --organism map_to_human \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

# Check if setup is complete
python /code/run_analysis.py --check-setup

Master Script Options:

  • --setup - Run initial database setup (one-time)
  • --force-setup - Force re-create databases
  • --check-setup - Verify setup status
  • --mesh TERM - MeSH term to analyze
  • --year-start YEAR - Start year
  • --year-end YEAR - End year
  • --organism MODE - human_only or map_to_human
  • --min-papers-per-gene N - Minimum papers per gene (default: 2)
  • --min-papers-per-edge N - Minimum shared papers (default: 2)

Pipeline Steps (Individual Scripts)

0. Build Gene and PMID Filters

Create human-centric filter files for optimized processing.

python /code/0_build_gene_pmid_filter.py

Input: Gene data files from NCBI Output:

  • /data/human_centric_genes.txt - Filtered gene IDs
  • /data/human_centric_pmids.txt - Filtered PubMed IDs Note: This is automatically run by run_analysis.py --setup

1. Extract MeSH-PubMed Mappings

Parse PubMed XML files and create a database of MeSH terms, PMIDs, years, and titles. Now with PMID filtering - only extracts articles with gene references.

python /code/1_extract_mesh_pubmed.py

Input: /data/pubmed-Dec-2023/*.xml.gz Output: /data/mesh_pubmed.db (SQLite database) Time: ~2-3 hours for ~1,167 XML files (with filtering)

2. Download Gene Data

Download gene-related files from NCBI FTP.

python /code/2_download_gene_data.py

Downloads:

  • gene2pubmed.gz - Gene to PubMed mappings (~237 MB)
  • gene_info.gz - Gene symbols and names (~1.3 GB)
  • gene_orthologs.gz - Ortholog relationships (~107 MB)

Output: Files saved to /data/

Use --force to re-download existing files.

3. Parse Gene Data

Parse downloaded gene files and populate database. Now with gene filtering - only stores human-centric genes.

python /code/3_parse_gene_data.py

Input: /data/gene*.gz files Output: /data/gene_pubmed.db (SQLite database)

4. Build Network

Construct the gene co-citation network with your parameters.

python /code/4_build_network.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 2 \
    --min-papers-per-edge 2

Parameters:

  • --mesh: MeSH term (e.g., "Chordoma", "Breast Neoplasms")
  • --year-start: Start year for publications
  • --year-end: End year for publications
  • --organism: Organism mode
    • human_only: Only human genes (tax_id=9606)
    • map_to_human: Include all species, map to human orthologs
  • --min-papers-per-gene: Minimum publications per gene (default: 2)
  • --min-papers-per-edge: Minimum shared publications for edge (default: 2)

Output: /results/{mesh}_{year_start}_{year_end}_{organism}/network_data.json

5. Export to DOT Format

Generate DOT language file for network visualization.

python /code/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json

Output: network.dot in the same directory

6. Generate HTML Pages

Create interactive HTML visualization with clickable nodes and edges.

python /code/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json

Output: Multiple HTML files:

  • index.html - Main page with network overview
  • gene_{id}.html - Papers for each gene
  • edge_{id1}_{id2}.html - Shared papers for each edge

All PubMed IDs link to https://pubmed.ncbi.nlm.nih.gov/

Example Workflows

Using Master Script (Recommended)

Complete example using the all-in-one script:

# First time: Run setup (creates databases)
python /code/run_analysis.py --setup

# Generate a Chordoma network (last 10 years, human only)
python /code/run_analysis.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

# View results
# Open /results/Chordoma_2014_2024_human_only/index.html in browser

Using Individual Scripts

If you prefer more control, run individual scripts:

# One-time setup (steps 1-3)
python /code/1_extract_mesh_pubmed.py
python /code/2_download_gene_data.py
python /code/3_parse_gene_data.py

# Build network (steps 4-6)
python /code/4_build_network.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

python /code/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json
python /code/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json

# View results
# Open /results/Chordoma_2014_2024_human_only/index.html in browser

Directory Structure

/code/                              # Python scripts
  ├── run_analysis.py               # Master script (all-in-one)
  ├── utils.py                      # Helper functions
  ├── 0_build_gene_pmid_filter.py  # Build filtering sets
  ├── 1_extract_mesh_pubmed.py      # Parse PubMed XML (with filtering)
  ├── 2_download_gene_data.py       # Download NCBI data
  ├── 3_parse_gene_data.py          # Parse gene files (with filtering)
  ├── 4_build_network.py            # Build co-citation network
  ├── 5_export_dot.py               # Export to DOT format
  ├── 6_generate_html.py            # Generate HTML pages
  ├── FILTERING_STRATEGY.md         # Filtering documentation
  └── README.md                     # This file

/data/                              # Input data and databases
  ├── pubmed-Dec-2023/              # PubMed XML files (provided)
  ├── gene2pubmed.gz                # Downloaded from NCBI
  ├── gene_info.gz                  # Downloaded from NCBI
  ├── gene_orthologs.gz             # Downloaded from NCBI
  ├── human_centric_genes.txt       # Filtered gene IDs
  ├── human_centric_pmids.txt       # Filtered PubMed IDs
  ├── mesh_pubmed.db                # SQLite: MeSH→PMID, years, titles (filtered)
  └── gene_pubmed.db                # SQLite: Gene→PMID, orthologs (filtered)

/scratch/                           # Temporary files
  └── (intermediate processing files)

/results/                           # Output networks
  └── {mesh}_{year_start}_{year_end}_{organism}/
      ├── network_data.json         # Network data
      ├── network.dot               # DOT format graph
      ├── index.html                # Main visualization
      ├── gene_{id}.html            # Per-gene pages
      └── edge_{id}_{id}.html       # Per-edge pages

Organism Modes

human_only

Only includes genes from Homo sapiens (tax_id=9606).

map_to_human

Includes genes from all species but maps them to human orthologs:

  • Human genes are used directly
  • Non-human genes are mapped to their human orthologs via the gene_orthologs database
  • If a non-human gene has multiple human orthologs, papers are assigned to all
  • Network nodes are human genes, but may include papers from model organisms

Common Taxonomy IDs

  • Human: 9606
  • Mouse: 10090
  • Rat: 10116
  • Zebrafish: 7955
  • Fly (D. melanogaster): 7227
  • Worm (C. elegans): 6239
  • Yeast (S. cerevisiae): 4932

Requirements

  • Python 3.6+
  • Standard library modules (sqlite3, gzip, xml.etree.ElementTree, json, pathlib, etc.)
  • Optional: tqdm (for progress bars)

Notes

  • Setup (Steps 0-3) only needs to be run once
  • Steps 4-6 can be run multiple times with different parameters
  • PubMed data is from December 2023 snapshot
  • Filtering optimization: Database focuses on human genes and genes with human orthologs
  • Processing time: Setup takes ~2-3 hours (with filtering enabled)
  • Use filters to focus on high-confidence associations

Troubleshooting

Missing database files: Ensure steps 1-3 completed successfully No results: Try relaxing filters (lower thresholds) or check MeSH term spelling Memory issues: Process smaller date ranges or increase min-papers filters

Quick Reference Card

# First time setup
python /code/run_analysis.py --setup

# Check setup status
python /code/run_analysis.py --check-setup

# Generate network (basic)
python /code/run_analysis.py --mesh "Disease Name" --year-start 2014 --year-end 2024

# Generate network (all options)
python /code/run_analysis.py \
    --mesh "Disease Name" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \               # or map_to_human
    --min-papers-per-gene 2 \
    --min-papers-per-edge 2

# View help
python /code/run_analysis.py --help

References