Build co-citation networks from PubMed literature with support for genes, cell types, and drugs. Analyze disease-entity relationships based on MeSH terms with date range filtering and organism-specific analysis including ortholog mapping.
This repository contains two versions of the pipeline:
- Nodes: Genes, Cell Types (Cell Ontology), and Drugs (DrugBank)
- Edges: Co-citation relationships between any entity types
- Features: MeSH-to-Cell Ontology mapping, drug target genes, enhanced HTML visualizations
- Use case: Comprehensive disease analysis including cellular and therapeutic context
- Documentation: v2/README.md
- Nodes: Genes only
- Edges: Gene co-citation relationships
- Features: MeSH-based or gene-centered networks, human-centric filtering
- Use case: Gene-focused interaction analysis
- Documentation: See below for v1 documentation
See the v2/README.md for complete documentation.
# Build databases (one-time setup)
cd v2
bash build_databases.sh
# Run multi-entity analysis
bash run_multi_entity_analysis.sh "Psoriasis" 2014 2024Continue reading below for v1 documentation.
This pipeline creates networks where:
- Nodes are genes
- Edges connect genes that appear in the same publications (co-citations)
- Edge weights represent the number of shared publications
- Human-centric filtering: Focuses on human genes and genes with human orthologs
- Extract MeSH term to PubMed mappings from local PubMed XML files
- Download and parse NCBI gene data (gene2pubmed, gene_info, gene_orthologs)
- Build networks from MeSH terms OR starting genes
- Filter by publication year range
- Filter by organism (human-only or map orthologs to human genes)
- Apply minimum thresholds for papers per gene and papers per edge
- Export to DOT format for visualization with Graphviz
- Generate interactive HTML pages with links to PubMed
pubmed-gene-network/
├── README.md # This file
├── requirements.txt # Python dependencies (v1)
├── LICENSE # License information
├── v2/ # Version 2: Multi-entity analysis ⭐
│ ├── README.md # v2 documentation
│ ├── WORKFLOW.md # Pipeline workflow guide
│ ├── USAGE_GUIDE.md # Quick usage guide
│ ├── MULTI_ENTITY_BUILD_GUIDE.md # Database build guide
│ ├── build_databases.sh # One-time database setup
│ ├── run_multi_entity_analysis.sh # Run analysis
│ ├── 0_build_mesh_cell_mapping.py # MeSH to Cell Ontology
│ ├── 1_download_gene_data.py # Download NCBI data
│ ├── 2_parse_gene_data.py # Parse gene data
│ ├── 3_extract_mesh_pubmed_database.py # Extract MeSH-PMID
│ ├── 4_build_cell_database.py # Build cell type database
│ ├── 5_build_drug_database.py # Build drug database
│ ├── 6_query_multi_entity_network.py # Query network
│ ├── 7_export_dot.py # Export to DOT format
│ ├── 8_generate_html.py # Generate HTML visualization
│ └── utils.py # Shared utilities
├── v1/ # Version 1: Gene-only analysis (legacy)
│ ├── utils.py # Shared utility functions
│ ├── run_analysis.py # Master script (recommended)
│ ├── 0_build_gene_pmid_filter.py # Build human-centric filters
│ ├── 1_extract_mesh_pubmed.py # Extract MeSH-PMID mappings
│ ├── 2_download_gene_data.py # Download NCBI gene data
│ ├── 3_parse_gene_data.py # Parse gene data to database
│ ├── 4_build_network.py # Build MeSH-based network
│ ├── 5_export_dot.py # Export network to DOT format
│ ├── 6_generate_html.py # Generate HTML visualization
│ ├── build_gene_interaction_network.py # Build gene-based network
│ └── query_notochord.py # Query genes for specific terms
├── docs/ # v1 Documentation
│ ├── README.md # Original detailed documentation
│ ├── USAGE_GUIDE.md # Quick usage guide
│ ├── README_GENE_NETWORK.md # Gene network documentation
│ └── FILTERING_STRATEGY.md # Database filtering strategy
└── examples/ # v1 Example scripts
└── example_usage.sh # Example commands
- Clone or download this repository
- Install dependencies:
pip install -r requirements.txt-
Download PubMed baseline XML files:
- Visit: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
- Download all XML files (
.xml.gzformat) - Place them in
/data/pubmed-baseline/or your preferred location - Note: This is a large download (~1000+ files, several hundred GB)
- You can also use the updatefiles if you want more recent data
-
Create required directories:
mkdir -p /data /results /scratch- Update the PubMed XML path in
v1/utils.pyif you placed files in a different location
Run the master script with --setup to create the databases:
python v1/run_analysis.py --setupThis will:
- Build gene and PMID filters (Step 0)
- Extract MeSH-PubMed mappings from XML files (Step 1, ~2-3 hours)
- Download NCBI gene data files (Step 2)
- Parse gene data into database (Step 3)
You can check setup status at any time:
python v1/run_analysis.py --check-setupOnce setup is complete, you can generate networks in two ways:
python v1/run_analysis.py \
--mesh "Chordoma" \
--year-start 2014 \
--year-end 2024 \
--organism human_only \
--min-papers-per-gene 3 \
--min-papers-per-edge 2python v1/build_gene_interaction_network.py \
--gene IL17A \
--year-start 2014 \
--year-end 2024 \
--organism map_to_human \
--min-papers-per-gene 5 \
--min-papers-per-edge 3Both network types produce compatible JSON output. Generate visualizations:
# Generate interactive HTML
python v1/6_generate_html.py /results/Chordoma_2014_2024_human_only/network_data.json
# Generate DOT file for Graphviz
python v1/5_export_dot.py /results/Chordoma_2014_2024_human_only/network_data.json
dot -Tsvg network.dot -o network.svgStarting from a disease or topic:
- Input: MeSH term (e.g., "Chordoma", "Breast Neoplasms")
- Output: Network of genes co-cited in papers tagged with that MeSH term
- Use case: Topic-focused research, disease-gene associations
Starting from a single gene:
- Input: Gene symbol (e.g., "IL17A", "TNF", "TP53")
- Output: Network of genes co-cited with the starting gene
- Use case: Gene-centered analysis, finding interaction partners
--year-start YEAR- Start year for publication filter--year-end YEAR- End year for publication filter--organism MODE- Organism filtering mode:human_only: Only human genes (tax_id=9606)map_to_human: Include all species, map to human orthologs (default)
--min-papers-per-gene N- Minimum publications per gene (default: 2-5)--min-papers-per-edge N- Minimum shared publications for edge (default: 2-3)
--mesh TERM- MeSH term to analyze (required)
--gene SYMBOL- Starting gene symbol (required)--exclude-seed- Exclude starting gene from network (optional)
To optimize performance and focus on relevant literature, the pipeline implements human-centric filtering:
- Gene filtering: Includes human genes (~194K) plus genes with human orthologs (~9.5M)
- PMID filtering: Only extracts PubMed articles that reference filtered genes (~1.27M PMIDs)
- Benefits: Reduces database size by ~97%, speeds up processing, maintains complete coverage of human gene literature
See docs/FILTERING_STRATEGY.md for detailed information.
Networks are saved to /results/{network_name}/:
network_data.json- Complete network data with metadatanodes.csv- List of genes with paper countsedges.csv- Network edges with shared paper countsindex.html- Interactive visualization (after running 6_generate_html.py)network.dot- Graphviz file (after running 5_export_dot.py)
Detailed documentation is available in the docs/ directory:
- docs/README.md - Complete pipeline documentation
- docs/USAGE_GUIDE.md - Quick usage guide for gene networks
- docs/README_GENE_NETWORK.md - Gene network builder details
- docs/FILTERING_STRATEGY.md - Database filtering strategy
See examples/example_usage.sh for various usage examples.
python v1/run_analysis.py \
--mesh "Chordoma" \
--year-start 2014 \
--year-end 2024 \
--organism human_only \
--min-papers-per-gene 3 \
--min-papers-per-edge 2python v1/build_gene_interaction_network.py \
--gene IL17A \
--year-start 2014 \
--year-end 2024 \
--organism map_to_human \
--min-papers-per-gene 5 \
--min-papers-per-edge 3python v1/build_gene_interaction_network.py \
--gene TNF \
--year-start 2020 \
--year-end 2024 \
--organism human_only \
--min-papers-per-gene 10 \
--min-papers-per-edge 5- Python 3.6+
- Standard library modules (sqlite3, gzip, xml.etree.ElementTree, json, pathlib, etc.)
- Optional: tqdm (for progress bars during setup)
See requirements.txt for complete list.
-
PubMed XML files: PubMed baseline/updatefiles snapshot
- Download from: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
- Format: Compressed XML files (
.xml.gz) - Size: ~1000+ files, several hundred GB
- Expected location:
/data/pubmed-baseline/(or configure inv1/utils.py) - Contents: Complete PubMed article metadata including MeSH terms, PMIDs, titles, and publication years
- Alternative: Use updatefiles from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/ for more recent data
-
NCBI Gene data files: Downloaded automatically by setup script
gene2pubmed.gz- Gene to PubMed mappingsgene_info.gz- Gene symbols and namesgene_orthologs.gz- Ortholog relationships
Setup creates these files in /data/:
human_centric_genes.txt- Filtered gene IDs (92.5 MB)human_centric_pmids.txt- Filtered PubMed IDs (10.8 MB)mesh_pubmed.db- SQLite: MeSH→PMID, years, titlesgene_pubmed.db- SQLite: Gene→PMID, orthologs
- Setup time: ~2-3 hours (one-time, with filtering)
- Network generation: Minutes to hours depending on parameters
- Database size: Significantly reduced with human-centric filtering (~97% reduction)
- Start broad, then narrow: Begin with default parameters, then increase thresholds to reduce network size
- Use year filters: Recent papers may show emerging interactions
- Compare organism modes:
map_to_humanprovides more coverage,human_onlyis more specific - Adjust thresholds: For highly studied genes/topics, increase thresholds to focus on strongest interactions
- Visualize iteratively: Generate HTML first for quick viewing, then create publication-quality DOT graphics
- Human: 9606
- Mouse: 10090
- Rat: 10116
- Zebrafish: 7955
- Fly (D. melanogaster): 7227
- Worm (C. elegans): 6239
- Yeast (S. cerevisiae): 4932
Missing database files: Ensure setup completed successfully with python v1/run_analysis.py --check-setup
No results: Try relaxing filters (lower thresholds) or check MeSH term/gene symbol spelling
Memory issues: Process smaller date ranges or increase min-papers filters
Gene symbol not found: Check spelling and case (usually uppercase). The script will suggest similar symbols if found.
- PubMed DTD: https://wayback.archive-it.org/org-350/20240424204414/https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd
- NCBI Gene FTP: https://ftp.ncbi.nih.gov/gene/DATA/
- DOT Language: https://graphviz.org/doc/info/lang.html
- PubMed Baseline: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
Contributions are welcome! Please ensure:
- Code follows existing style and structure
- Documentation is updated for new features
- Test scripts are provided for significant changes
[Specify your license here]
If you use this tool in your research, please cite: [Add citation information if applicable]
[Add contact information or link to issues page]
This tool uses data from:
- NCBI PubMed
- NCBI Gene database
- MeSH (Medical Subject Headings)