PubMed Co-citation Network Analysis

Build co-citation networks from PubMed literature with support for genes, cell types, and drugs. Analyze disease-entity relationships based on MeSH terms with date range filtering and organism-specific analysis including ortholog mapping.

Versions

This repository contains two versions of the pipeline:

Version 2 (v2/) - Multi-Entity Network Analysis ⭐ RECOMMENDED

Nodes: Genes, Cell Types (Cell Ontology), and Drugs (DrugBank)
Edges: Co-citation relationships between any entity types
Features: MeSH-to-Cell Ontology mapping, drug target genes, enhanced HTML visualizations
Use case: Comprehensive disease analysis including cellular and therapeutic context
Documentation: v2/README.md

Version 1 (v1/) - Gene-Only Network Analysis (Legacy)

Nodes: Genes only
Edges: Gene co-citation relationships
Features: MeSH-based or gene-centered networks, human-centric filtering
Use case: Gene-focused interaction analysis
Documentation: See below for v1 documentation

Quick Start

For Multi-Entity Analysis (v2 - Recommended)

See the v2/README.md for complete documentation.

# Build databases (one-time setup)
cd v2
bash build_databases.sh

# Run multi-entity analysis
bash run_multi_entity_analysis.sh "Psoriasis" 2014 2024

For Gene-Only Analysis (v1 - Legacy)

Continue reading below for v1 documentation.

Version 1 (v1/) - Gene Co-citation Network Analysis

Overview

This pipeline creates networks where:

Nodes are genes
Edges connect genes that appear in the same publications (co-citations)
Edge weights represent the number of shared publications

Features

Human-centric filtering: Focuses on human genes and genes with human orthologs
Extract MeSH term to PubMed mappings from local PubMed XML files
Download and parse NCBI gene data (gene2pubmed, gene_info, gene_orthologs)
Build networks from MeSH terms OR starting genes
Filter by publication year range
Filter by organism (human-only or map orthologs to human genes)
Apply minimum thresholds for papers per gene and papers per edge
Export to DOT format for visualization with Graphviz
Generate interactive HTML pages with links to PubMed

Repository Structure

pubmed-gene-network/
├── README.md                    # This file
├── requirements.txt             # Python dependencies (v1)
├── LICENSE                      # License information
├── v2/                          # Version 2: Multi-entity analysis ⭐
│   ├── README.md                # v2 documentation
│   ├── WORKFLOW.md              # Pipeline workflow guide
│   ├── USAGE_GUIDE.md           # Quick usage guide
│   ├── MULTI_ENTITY_BUILD_GUIDE.md  # Database build guide
│   ├── build_databases.sh       # One-time database setup
│   ├── run_multi_entity_analysis.sh  # Run analysis
│   ├── 0_build_mesh_cell_mapping.py  # MeSH to Cell Ontology
│   ├── 1_download_gene_data.py       # Download NCBI data
│   ├── 2_parse_gene_data.py          # Parse gene data
│   ├── 3_extract_mesh_pubmed_database.py  # Extract MeSH-PMID
│   ├── 4_build_cell_database.py      # Build cell type database
│   ├── 5_build_drug_database.py      # Build drug database
│   ├── 6_query_multi_entity_network.py  # Query network
│   ├── 7_export_dot.py               # Export to DOT format
│   ├── 8_generate_html.py            # Generate HTML visualization
│   └── utils.py                 # Shared utilities
├── v1/                          # Version 1: Gene-only analysis (legacy)
│   ├── utils.py                 # Shared utility functions
│   ├── run_analysis.py          # Master script (recommended)
│   ├── 0_build_gene_pmid_filter.py    # Build human-centric filters
│   ├── 1_extract_mesh_pubmed.py       # Extract MeSH-PMID mappings
│   ├── 2_download_gene_data.py        # Download NCBI gene data
│   ├── 3_parse_gene_data.py           # Parse gene data to database
│   ├── 4_build_network.py             # Build MeSH-based network
│   ├── 5_export_dot.py                # Export network to DOT format
│   ├── 6_generate_html.py             # Generate HTML visualization
│   ├── build_gene_interaction_network.py  # Build gene-based network
│   └── query_notochord.py             # Query genes for specific terms
├── docs/                        # v1 Documentation
│   ├── README.md                # Original detailed documentation
│   ├── USAGE_GUIDE.md           # Quick usage guide
│   ├── README_GENE_NETWORK.md   # Gene network documentation
│   └── FILTERING_STRATEGY.md    # Database filtering strategy
└── examples/                    # v1 Example scripts
    └── example_usage.sh         # Example commands

Quick Start

Installation

Clone or download this repository
Install dependencies:

pip install -r requirements.txt

Download PubMed baseline XML files:
- Visit: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
- Download all XML files (.xml.gz format)
- Place them in /data/pubmed-baseline/ or your preferred location
- Note: This is a large download (~1000+ files, several hundred GB)
- You can also use the updatefiles if you want more recent data
Create required directories:

mkdir -p /data /results /scratch

Update the PubMed XML path in v1/utils.py if you placed files in a different location

First Time Setup

Run the master script with --setup to create the databases:

python v1/run_analysis.py --setup

This will:

Build gene and PMID filters (Step 0)
Extract MeSH-PubMed mappings from XML files (Step 1, ~2-3 hours)
Download NCBI gene data files (Step 2)
Parse gene data into database (Step 3)

You can check setup status at any time:

python v1/run_analysis.py --check-setup

Generate a Network

Once setup is complete, you can generate networks in two ways:

Option 1: MeSH-based Network (Topic-focused)

python v1/run_analysis.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

Option 2: Gene-based Network (Gene-centered)

python v1/build_gene_interaction_network.py \
    --gene IL17A \
    --year-start 2014 \
    --year-end 2024 \
    --organism map_to_human \
    --min-papers-per-gene 5 \
    --min-papers-per-edge 3

Visualize Results

Both network types produce compatible JSON output. Generate visualizations:

# Generate interactive HTML
python v1/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json

# Generate DOT file for Graphviz
python v1/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json
dot -Tsvg network.dot -o network.svg

Usage Modes

MeSH-based Networks

Starting from a disease or topic:

Input: MeSH term (e.g., "Chordoma", "Breast Neoplasms")
Output: Network of genes co-cited in papers tagged with that MeSH term
Use case: Topic-focused research, disease-gene associations

Gene-based Networks

Starting from a single gene:

Input: Gene symbol (e.g., "IL17A", "TNF", "TP53")
Output: Network of genes co-cited with the starting gene
Use case: Gene-centered analysis, finding interaction partners

Key Parameters

Common Parameters

--year-start YEAR - Start year for publication filter
--year-end YEAR - End year for publication filter
--organism MODE - Organism filtering mode:
- human_only: Only human genes (tax_id=9606)
- map_to_human: Include all species, map to human orthologs (default)
--min-papers-per-gene N - Minimum publications per gene (default: 2-5)
--min-papers-per-edge N - Minimum shared publications for edge (default: 2-3)

MeSH-specific

--mesh TERM - MeSH term to analyze (required)

Gene-specific

--gene SYMBOL - Starting gene symbol (required)
--exclude-seed - Exclude starting gene from network (optional)

Database Filtering Strategy

To optimize performance and focus on relevant literature, the pipeline implements human-centric filtering:

Gene filtering: Includes human genes (~194K) plus genes with human orthologs (~9.5M)
PMID filtering: Only extracts PubMed articles that reference filtered genes (~1.27M PMIDs)
Benefits: Reduces database size by ~97%, speeds up processing, maintains complete coverage of human gene literature

See docs/FILTERING_STRATEGY.md for detailed information.

Output Files

Networks are saved to /results/{network_name}/:

network_data.json - Complete network data with metadata
nodes.csv - List of genes with paper counts
edges.csv - Network edges with shared paper counts
index.html - Interactive visualization (after running 6_generate_html.py)
network.dot - Graphviz file (after running 5_export_dot.py)

Documentation

Detailed documentation is available in the docs/ directory:

docs/README.md - Complete pipeline documentation
docs/USAGE_GUIDE.md - Quick usage guide for gene networks
docs/README_GENE_NETWORK.md - Gene network builder details
docs/FILTERING_STRATEGY.md - Database filtering strategy

Examples

See examples/example_usage.sh for various usage examples.

Example 1: Chordoma Network (Last 10 Years)

python v1/run_analysis.py \
    --mesh "Chordoma" \
    --year-start 2014 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 3 \
    --min-papers-per-edge 2

Example 2: IL17A Interaction Network

python v1/build_gene_interaction_network.py \
    --gene IL17A \
    --year-start 2014 \
    --year-end 2024 \
    --organism map_to_human \
    --min-papers-per-gene 5 \
    --min-papers-per-edge 3

Example 3: TNF Network (Recent, Human Only)

python v1/build_gene_interaction_network.py \
    --gene TNF \
    --year-start 2020 \
    --year-end 2024 \
    --organism human_only \
    --min-papers-per-gene 10 \
    --min-papers-per-edge 5

Requirements

Python 3.6+
Standard library modules (sqlite3, gzip, xml.etree.ElementTree, json, pathlib, etc.)
Optional: tqdm (for progress bars during setup)

See requirements.txt for complete list.

Data Requirements

Required Input Data

PubMed XML files: PubMed baseline/updatefiles snapshot
- Download from: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
- Format: Compressed XML files (.xml.gz)
- Size: ~1000+ files, several hundred GB
- Expected location: /data/pubmed-baseline/ (or configure in v1/utils.py)
- Contents: Complete PubMed article metadata including MeSH terms, PMIDs, titles, and publication years
- Alternative: Use updatefiles from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/ for more recent data
NCBI Gene data files: Downloaded automatically by setup script
- gene2pubmed.gz - Gene to PubMed mappings
- gene_info.gz - Gene symbols and names
- gene_orthologs.gz - Ortholog relationships

Created Database Files

Setup creates these files in /data/:

human_centric_genes.txt - Filtered gene IDs (92.5 MB)
human_centric_pmids.txt - Filtered PubMed IDs (10.8 MB)
mesh_pubmed.db - SQLite: MeSH→PMID, years, titles
gene_pubmed.db - SQLite: Gene→PMID, orthologs

Performance

Setup time: ~2-3 hours (one-time, with filtering)
Network generation: Minutes to hours depending on parameters
Database size: Significantly reduced with human-centric filtering (~97% reduction)

Tips for Best Results

Start broad, then narrow: Begin with default parameters, then increase thresholds to reduce network size
Use year filters: Recent papers may show emerging interactions
Compare organism modes: map_to_human provides more coverage, human_only is more specific
Adjust thresholds: For highly studied genes/topics, increase thresholds to focus on strongest interactions
Visualize iteratively: Generate HTML first for quick viewing, then create publication-quality DOT graphics

Common Taxonomy IDs

Human: 9606
Mouse: 10090
Rat: 10116
Zebrafish: 7955
Fly (D. melanogaster): 7227
Worm (C. elegans): 6239
Yeast (S. cerevisiae): 4932

Troubleshooting

Missing database files: Ensure setup completed successfully with python v1/run_analysis.py --check-setup

No results: Try relaxing filters (lower thresholds) or check MeSH term/gene symbol spelling

Memory issues: Process smaller date ranges or increase min-papers filters

Gene symbol not found: Check spelling and case (usually uppercase). The script will suggest similar symbols if found.

References

PubMed DTD: https://wayback.archive-it.org/org-350/20240424204414/https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd
NCBI Gene FTP: https://ftp.ncbi.nih.gov/gene/DATA/
DOT Language: https://graphviz.org/doc/info/lang.html
PubMed Baseline: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

Contributing

Contributions are welcome! Please ensure:

Code follows existing style and structure
Documentation is updated for new features
Test scripts are provided for significant changes

License

[Specify your license here]

Citation

If you use this tool in your research, please cite: [Add citation information if applicable]

Contact

[Add contact information or link to issues page]

Acknowledgments

This tool uses data from:

NCBI PubMed
NCBI Gene database
MeSH (Medical Subject Headings)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
examples		examples
src		src
v1		v1
v2		v2
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
requirements.txt		requirements.txt
run_gene_analysis.sh		run_gene_analysis.sh
run_mesh_analysis.sh		run_mesh_analysis.sh

Folders and files

Latest commit

History

Repository files navigation

PubMed Co-citation Network Analysis

Versions

Version 2 (v2/) - Multi-Entity Network Analysis ⭐ RECOMMENDED

Version 1 (v1/) - Gene-Only Network Analysis (Legacy)

Quick Start

For Multi-Entity Analysis (v2 - Recommended)

For Gene-Only Analysis (v1 - Legacy)

Version 1 (v1/) - Gene Co-citation Network Analysis

Overview

Features

Repository Structure

Quick Start

Installation

First Time Setup

Generate a Network

Option 1: MeSH-based Network (Topic-focused)

Option 2: Gene-based Network (Gene-centered)

Visualize Results

Usage Modes

MeSH-based Networks

Gene-based Networks

Key Parameters

Common Parameters

MeSH-specific

Gene-specific

Database Filtering Strategy

Output Files

Documentation

Examples

Example 1: Chordoma Network (Last 10 Years)

Example 2: IL17A Interaction Network

Example 3: TNF Network (Recent, Human Only)

Requirements

Data Requirements

Required Input Data

Created Database Files

Performance

Tips for Best Results

Common Taxonomy IDs

Troubleshooting

References

Contributing

License

Citation

Contact

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages