PubMed Gene Co-citation Network Analysis

Build gene co-citation networks from PubMed literature based on MeSH terms, with support for date range filtering and organism-specific analysis including ortholog mapping.

Overview

This pipeline creates networks where:

Nodes are genes
Edges connect genes that appear in the same publications (co-citations)
Edge weights represent the number of shared publications

Features

Human-centric filtering: Focuses on human genes and genes with human orthologs
Extract MeSH term to PubMed mappings from local PubMed XML files
Download and parse NCBI gene data (gene2pubmed, gene_info, gene_orthologs)
Filter by publication year range
Filter by organism (human-only or map orthologs to human genes)
Apply minimum thresholds for papers per gene and papers per edge
Export to DOT format for visualization
Generate interactive HTML pages with links to PubMed

Database Filtering Strategy

To optimize performance and focus on relevant literature, the pipeline implements human-centric filtering:

Gene filtering: Includes human genes (~194K) plus genes with human orthologs (~9.5M)
PMID filtering: Only extracts PubMed articles that reference filtered genes (~1.27M PMIDs)
Benefits: Reduces database size by ~97%, speeds up processing, maintains complete coverage of human gene literature

See FILTERING_STRATEGY.md for detailed information.

Quick Start

Master Script (Recommended)

Use the all-in-one run_analysis.py script for easy setup and analysis:

# First time: Run setup (creates databases)
python /code/run_analysis.py --setup

# Generate a network
python /code/run_analysis.py --mesh "Chordoma" --year-start 2014 --year-end 2024

# With custom parameters
python /code/run_analysis.py --mesh "Breast Neoplasms" \
    --year-start 2010 --year-end 2023 \
    --organism map_to_human \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

# Check if setup is complete
python /code/run_analysis.py --check-setup

Master Script Options:

--setup - Run initial database setup (one-time)
--force-setup - Force re-create databases
--check-setup - Verify setup status
--mesh TERM - MeSH term to analyze
--year-start YEAR - Start year
--year-end YEAR - End year
--organism MODE - human_only or map_to_human
--min-papers-per-gene N - Minimum papers per gene (default: 2)
--min-papers-per-edge N - Minimum shared papers (default: 2)

Pipeline Steps (Individual Scripts)

0. Build Gene and PMID Filters

Create human-centric filter files for optimized processing.

python /code/0_build_gene_pmid_filter.py

Input: Gene data files from NCBI Output:

/data/human_centric_genes.txt - Filtered gene IDs
/data/human_centric_pmids.txt - Filtered PubMed IDs Note: This is automatically run by run_analysis.py --setup

1. Extract MeSH-PubMed Mappings

Parse PubMed XML files and create a database of MeSH terms, PMIDs, years, and titles. Now with PMID filtering - only extracts articles with gene references.

python /code/1_extract_mesh_pubmed.py

Input: /data/pubmed-Dec-2023/*.xml.gz Output: /data/mesh_pubmed.db (SQLite database) Time: ~2-3 hours for ~1,167 XML files (with filtering)

2. Download Gene Data

Download gene-related files from NCBI FTP.

python /code/2_download_gene_data.py

Downloads:

gene2pubmed.gz - Gene to PubMed mappings (~237 MB)
gene_info.gz - Gene symbols and names (~1.3 GB)
gene_orthologs.gz - Ortholog relationships (~107 MB)

Output: Files saved to /data/

Use --force to re-download existing files.

3. Parse Gene Data

Parse downloaded gene files and populate database. Now with gene filtering - only stores human-centric genes.

python /code/3_parse_gene_data.py

Input: /data/gene*.gz files Output: /data/gene_pubmed.db (SQLite database)

4. Build Network

Construct the gene co-citation network with your parameters.

python /code/4_build_network.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 2 \
    --min-papers-per-edge 2

Parameters:

--mesh: MeSH term (e.g., "Chordoma", "Breast Neoplasms")
--year-start: Start year for publications
--year-end: End year for publications
--organism: Organism mode
- human_only: Only human genes (tax_id=9606)
- map_to_human: Include all species, map to human orthologs
--min-papers-per-gene: Minimum publications per gene (default: 2)
--min-papers-per-edge: Minimum shared publications for edge (default: 2)

Output: /results/{mesh}_{year_start}_{year_end}_{organism}/network_data.json

5. Export to DOT Format

Generate DOT language file for network visualization.

python /code/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json

Output: network.dot in the same directory

6. Generate HTML Pages

Create interactive HTML visualization with clickable nodes and edges.

python /code/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json

Output: Multiple HTML files:

index.html - Main page with network overview
gene_{id}.html - Papers for each gene
edge_{id1}_{id2}.html - Shared papers for each edge

All PubMed IDs link to https://pubmed.ncbi.nlm.nih.gov/

Example Workflows

Using Master Script (Recommended)

Complete example using the all-in-one script:

# First time: Run setup (creates databases)
python /code/run_analysis.py --setup

# Generate a Chordoma network (last 10 years, human only)
python /code/run_analysis.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

# View results
# Open /results/Chordoma_2014_2024_human_only/index.html in browser

Using Individual Scripts

If you prefer more control, run individual scripts:

# One-time setup (steps 1-3)
python /code/1_extract_mesh_pubmed.py
python /code/2_download_gene_data.py
python /code/3_parse_gene_data.py

# Build network (steps 4-6)
python /code/4_build_network.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

python /code/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json
python /code/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json

# View results
# Open /results/Chordoma_2014_2024_human_only/index.html in browser

Directory Structure

/code/                              # Python scripts
  ├── run_analysis.py               # Master script (all-in-one)
  ├── utils.py                      # Helper functions
  ├── 0_build_gene_pmid_filter.py  # Build filtering sets
  ├── 1_extract_mesh_pubmed.py      # Parse PubMed XML (with filtering)
  ├── 2_download_gene_data.py       # Download NCBI data
  ├── 3_parse_gene_data.py          # Parse gene files (with filtering)
  ├── 4_build_network.py            # Build co-citation network
  ├── 5_export_dot.py               # Export to DOT format
  ├── 6_generate_html.py            # Generate HTML pages
  ├── FILTERING_STRATEGY.md         # Filtering documentation
  └── README.md                     # This file

/data/                              # Input data and databases
  ├── pubmed-Dec-2023/              # PubMed XML files (provided)
  ├── gene2pubmed.gz                # Downloaded from NCBI
  ├── gene_info.gz                  # Downloaded from NCBI
  ├── gene_orthologs.gz             # Downloaded from NCBI
  ├── human_centric_genes.txt       # Filtered gene IDs
  ├── human_centric_pmids.txt       # Filtered PubMed IDs
  ├── mesh_pubmed.db                # SQLite: MeSH→PMID, years, titles (filtered)
  └── gene_pubmed.db                # SQLite: Gene→PMID, orthologs (filtered)

/scratch/                           # Temporary files
  └── (intermediate processing files)

/results/                           # Output networks
  └── {mesh}_{year_start}_{year_end}_{organism}/
      ├── network_data.json         # Network data
      ├── network.dot               # DOT format graph
      ├── index.html                # Main visualization
      ├── gene_{id}.html            # Per-gene pages
      └── edge_{id}_{id}.html       # Per-edge pages

Organism Modes

human_only

Only includes genes from Homo sapiens (tax_id=9606).

map_to_human

Includes genes from all species but maps them to human orthologs:

Human genes are used directly
Non-human genes are mapped to their human orthologs via the gene_orthologs database
If a non-human gene has multiple human orthologs, papers are assigned to all
Network nodes are human genes, but may include papers from model organisms

Common Taxonomy IDs

Human: 9606
Mouse: 10090
Rat: 10116
Zebrafish: 7955
Fly (D. melanogaster): 7227
Worm (C. elegans): 6239
Yeast (S. cerevisiae): 4932

Requirements

Python 3.6+
Standard library modules (sqlite3, gzip, xml.etree.ElementTree, json, pathlib, etc.)
Optional: tqdm (for progress bars)

Notes

Setup (Steps 0-3) only needs to be run once
Steps 4-6 can be run multiple times with different parameters
PubMed data is from December 2023 snapshot
Filtering optimization: Database focuses on human genes and genes with human orthologs
Processing time: Setup takes ~2-3 hours (with filtering enabled)
Use filters to focus on high-confidence associations

Troubleshooting

Missing database files: Ensure steps 1-3 completed successfully No results: Try relaxing filters (lower thresholds) or check MeSH term spelling Memory issues: Process smaller date ranges or increase min-papers filters

Quick Reference Card

# First time setup
python /code/run_analysis.py --setup

# Check setup status
python /code/run_analysis.py --check-setup

# Generate network (basic)
python /code/run_analysis.py --mesh "Disease Name" --year-start 2014 --year-end 2024

# Generate network (all options)
python /code/run_analysis.py \
    --mesh "Disease Name" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \               # or map_to_human
    --min-papers-per-gene 2 \
    --min-papers-per-edge 2

# View help
python /code/run_analysis.py --help

References

PubMed DTD: https://wayback.archive-it.org/org-350/20240424204414/https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd
NCBI Gene FTP: https://ftp.ncbi.nih.gov/gene/DATA/
DOT Language: https://graphviz.org/doc/info/lang.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubMed Gene Co-citation Network Analysis

Overview

Features

Database Filtering Strategy

Quick Start

Master Script (Recommended)

Pipeline Steps (Individual Scripts)

0. Build Gene and PMID Filters

1. Extract MeSH-PubMed Mappings

2. Download Gene Data

3. Parse Gene Data

4. Build Network

5. Export to DOT Format

6. Generate HTML Pages

Example Workflows

Using Master Script (Recommended)

Using Individual Scripts

Directory Structure

Organism Modes

human_only

map_to_human

Common Taxonomy IDs

Requirements

Notes

Troubleshooting

Quick Reference Card

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PubMed Gene Co-citation Network Analysis

Overview

Features

Database Filtering Strategy

Quick Start

Master Script (Recommended)

Pipeline Steps (Individual Scripts)

0. Build Gene and PMID Filters

1. Extract MeSH-PubMed Mappings

2. Download Gene Data

3. Parse Gene Data

4. Build Network

5. Export to DOT Format

6. Generate HTML Pages

Example Workflows

Using Master Script (Recommended)

Using Individual Scripts

Directory Structure

Organism Modes

human_only

map_to_human

Common Taxonomy IDs

Requirements

Notes

Troubleshooting

Quick Reference Card

References