Build gene co-citation networks from PubMed literature based on MeSH terms, with support for date range filtering and organism-specific analysis including ortholog mapping.
This pipeline creates networks where:
- Nodes are genes
- Edges connect genes that appear in the same publications (co-citations)
- Edge weights represent the number of shared publications
- Human-centric filtering: Focuses on human genes and genes with human orthologs
- Extract MeSH term to PubMed mappings from local PubMed XML files
- Download and parse NCBI gene data (gene2pubmed, gene_info, gene_orthologs)
- Filter by publication year range
- Filter by organism (human-only or map orthologs to human genes)
- Apply minimum thresholds for papers per gene and papers per edge
- Export to DOT format for visualization
- Generate interactive HTML pages with links to PubMed
To optimize performance and focus on relevant literature, the pipeline implements human-centric filtering:
- Gene filtering: Includes human genes (~194K) plus genes with human orthologs (~9.5M)
- PMID filtering: Only extracts PubMed articles that reference filtered genes (~1.27M PMIDs)
- Benefits: Reduces database size by ~97%, speeds up processing, maintains complete coverage of human gene literature
See FILTERING_STRATEGY.md for detailed information.
Use the all-in-one run_analysis.py script for easy setup and analysis:
# First time: Run setup (creates databases)
python /code/run_analysis.py --setup
# Generate a network
python /code/run_analysis.py --mesh "Chordoma" --year-start 2014 --year-end 2024
# With custom parameters
python /code/run_analysis.py --mesh "Breast Neoplasms" \
--year-start 2010 --year-end 2023 \
--organism map_to_human \
--min-papers-per-gene 3 \
--min-papers-per-edge 2
# Check if setup is complete
python /code/run_analysis.py --check-setupMaster Script Options:
--setup- Run initial database setup (one-time)--force-setup- Force re-create databases--check-setup- Verify setup status--mesh TERM- MeSH term to analyze--year-start YEAR- Start year--year-end YEAR- End year--organism MODE-human_onlyormap_to_human--min-papers-per-gene N- Minimum papers per gene (default: 2)--min-papers-per-edge N- Minimum shared papers (default: 2)
Create human-centric filter files for optimized processing.
python /code/0_build_gene_pmid_filter.pyInput: Gene data files from NCBI Output:
/data/human_centric_genes.txt- Filtered gene IDs/data/human_centric_pmids.txt- Filtered PubMed IDs Note: This is automatically run byrun_analysis.py --setup
Parse PubMed XML files and create a database of MeSH terms, PMIDs, years, and titles. Now with PMID filtering - only extracts articles with gene references.
python /code/1_extract_mesh_pubmed.pyInput: /data/pubmed-Dec-2023/*.xml.gz
Output: /data/mesh_pubmed.db (SQLite database)
Time: ~2-3 hours for ~1,167 XML files (with filtering)
Download gene-related files from NCBI FTP.
python /code/2_download_gene_data.pyDownloads:
gene2pubmed.gz- Gene to PubMed mappings (~237 MB)gene_info.gz- Gene symbols and names (~1.3 GB)gene_orthologs.gz- Ortholog relationships (~107 MB)
Output: Files saved to /data/
Use --force to re-download existing files.
Parse downloaded gene files and populate database. Now with gene filtering - only stores human-centric genes.
python /code/3_parse_gene_data.pyInput: /data/gene*.gz files
Output: /data/gene_pubmed.db (SQLite database)
Construct the gene co-citation network with your parameters.
python /code/4_build_network.py \
--mesh "Chordoma" \
--year-start 2014 \
--year-end 2024 \
--organism human_only \
--min-papers-per-gene 2 \
--min-papers-per-edge 2Parameters:
--mesh: MeSH term (e.g., "Chordoma", "Breast Neoplasms")--year-start: Start year for publications--year-end: End year for publications--organism: Organism modehuman_only: Only human genes (tax_id=9606)map_to_human: Include all species, map to human orthologs
--min-papers-per-gene: Minimum publications per gene (default: 2)--min-papers-per-edge: Minimum shared publications for edge (default: 2)
Output: /results/{mesh}_{year_start}_{year_end}_{organism}/network_data.json
Generate DOT language file for network visualization.
python /code/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.jsonOutput: network.dot in the same directory
Create interactive HTML visualization with clickable nodes and edges.
python /code/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.jsonOutput: Multiple HTML files:
index.html- Main page with network overviewgene_{id}.html- Papers for each geneedge_{id1}_{id2}.html- Shared papers for each edge
All PubMed IDs link to https://pubmed.ncbi.nlm.nih.gov/
Complete example using the all-in-one script:
# First time: Run setup (creates databases)
python /code/run_analysis.py --setup
# Generate a Chordoma network (last 10 years, human only)
python /code/run_analysis.py \
--mesh "Chordoma" \
--year-start 2014 \
--year-end 2024 \
--organism human_only \
--min-papers-per-gene 3 \
--min-papers-per-edge 2
# View results
# Open /results/Chordoma_2014_2024_human_only/index.html in browserIf you prefer more control, run individual scripts:
# One-time setup (steps 1-3)
python /code/1_extract_mesh_pubmed.py
python /code/2_download_gene_data.py
python /code/3_parse_gene_data.py
# Build network (steps 4-6)
python /code/4_build_network.py \
--mesh "Chordoma" \
--year-start 2014 \
--year-end 2024 \
--organism human_only \
--min-papers-per-gene 3 \
--min-papers-per-edge 2
python /code/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json
python /code/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json
# View results
# Open /results/Chordoma_2014_2024_human_only/index.html in browser/code/ # Python scripts
├── run_analysis.py # Master script (all-in-one)
├── utils.py # Helper functions
├── 0_build_gene_pmid_filter.py # Build filtering sets
├── 1_extract_mesh_pubmed.py # Parse PubMed XML (with filtering)
├── 2_download_gene_data.py # Download NCBI data
├── 3_parse_gene_data.py # Parse gene files (with filtering)
├── 4_build_network.py # Build co-citation network
├── 5_export_dot.py # Export to DOT format
├── 6_generate_html.py # Generate HTML pages
├── FILTERING_STRATEGY.md # Filtering documentation
└── README.md # This file
/data/ # Input data and databases
├── pubmed-Dec-2023/ # PubMed XML files (provided)
├── gene2pubmed.gz # Downloaded from NCBI
├── gene_info.gz # Downloaded from NCBI
├── gene_orthologs.gz # Downloaded from NCBI
├── human_centric_genes.txt # Filtered gene IDs
├── human_centric_pmids.txt # Filtered PubMed IDs
├── mesh_pubmed.db # SQLite: MeSH→PMID, years, titles (filtered)
└── gene_pubmed.db # SQLite: Gene→PMID, orthologs (filtered)
/scratch/ # Temporary files
└── (intermediate processing files)
/results/ # Output networks
└── {mesh}_{year_start}_{year_end}_{organism}/
├── network_data.json # Network data
├── network.dot # DOT format graph
├── index.html # Main visualization
├── gene_{id}.html # Per-gene pages
└── edge_{id}_{id}.html # Per-edge pages
Only includes genes from Homo sapiens (tax_id=9606).
Includes genes from all species but maps them to human orthologs:
- Human genes are used directly
- Non-human genes are mapped to their human orthologs via the gene_orthologs database
- If a non-human gene has multiple human orthologs, papers are assigned to all
- Network nodes are human genes, but may include papers from model organisms
- Human: 9606
- Mouse: 10090
- Rat: 10116
- Zebrafish: 7955
- Fly (D. melanogaster): 7227
- Worm (C. elegans): 6239
- Yeast (S. cerevisiae): 4932
- Python 3.6+
- Standard library modules (sqlite3, gzip, xml.etree.ElementTree, json, pathlib, etc.)
- Optional: tqdm (for progress bars)
- Setup (Steps 0-3) only needs to be run once
- Steps 4-6 can be run multiple times with different parameters
- PubMed data is from December 2023 snapshot
- Filtering optimization: Database focuses on human genes and genes with human orthologs
- Processing time: Setup takes ~2-3 hours (with filtering enabled)
- Use filters to focus on high-confidence associations
Missing database files: Ensure steps 1-3 completed successfully No results: Try relaxing filters (lower thresholds) or check MeSH term spelling Memory issues: Process smaller date ranges or increase min-papers filters
# First time setup
python /code/run_analysis.py --setup
# Check setup status
python /code/run_analysis.py --check-setup
# Generate network (basic)
python /code/run_analysis.py --mesh "Disease Name" --year-start 2014 --year-end 2024
# Generate network (all options)
python /code/run_analysis.py \
--mesh "Disease Name" \
--year-start 2014 \
--year-end 2024 \
--organism human_only \ # or map_to_human
--min-papers-per-gene 2 \
--min-papers-per-edge 2
# View help
python /code/run_analysis.py --help