Skip to content

Latest commit

 

History

History
243 lines (187 loc) · 7.63 KB

File metadata and controls

243 lines (187 loc) · 7.63 KB

Gene Interaction Network Builder

This document describes the new gene interaction network builder that complements the existing MeSH-based co-citation network analysis.

Overview

The build_gene_interaction_network.py script builds a gene co-citation network starting from a single gene (e.g., IL17A, TNF) instead of a MeSH term. It uses the same underlying logic and parameters as 4_build_network.py.

How It Works

  1. Start with a gene: Provide a gene symbol (e.g., IL17A)
  2. Find associated papers: Retrieves all PubMed IDs associated with that gene
  3. Identify co-cited genes: Finds all other genes mentioned in those papers
  4. Build network: Creates a co-occurrence network based on shared PMIDs
  5. Apply filters: Uses the same filtering parameters as the MeSH-based network

Usage

python code/build_gene_interaction_network.py --gene IL17A --year-start 2014 --year-end 2024 --organism map_to_human

Required Parameters

  • --gene: Starting gene symbol (e.g., 'IL17A', 'TNF', 'TP53')

Optional Parameters

  • --year-start: Start year for filtering papers (optional, uses all years if not specified)
  • --year-end: End year for filtering papers (optional, uses all years if not specified)
  • --organism: Organism filtering mode (default: map_to_human)
    • human_only: Only include human genes
    • map_to_human: Include non-human genes mapped to human orthologs
  • --min-papers-per-gene: Minimum papers per gene to include in network (default: 5)
  • --min-papers-per-edge: Minimum shared papers required for an edge (default: 3)
  • --exclude-seed: Exclude the starting gene from the network (default: False, gene is included)

Examples

Example 1: IL17A Network (2014-2024)

python code/build_gene_interaction_network.py \
  --gene IL17A \
  --year-start 2014 \
  --year-end 2024 \
  --organism map_to_human

Results:

  • 929 seed papers associated with IL17A
  • 29 genes in the network (including IL17A, after filtering)
  • 38 edges (gene pairs with ≥3 shared papers)

Top genes:

  • IL17A (929 papers - the starting gene)
  • IL17F (100 papers)
  • IL23A (40 papers)
  • IFNG (31 papers)
  • IL10, IL6, TNF (25 papers each)

Direct interactions with IL17A:

  • IL23A (40 shared papers)
  • IFNG (31 shared papers)
  • IL10, IL6, TNF (25 shared papers each)
  • IL17RA (22 shared papers)
  • TGFB1 (18 shared papers)

Example 2: TNF Network (2020-2024, human only)

python code/build_gene_interaction_network.py \
  --gene TNF \
  --year-start 2020 \
  --year-end 2024 \
  --organism human_only \
  --min-papers-per-gene 10 \
  --min-papers-per-edge 5

Results:

  • 415 seed papers associated with TNF
  • 7 genes in the network (after filtering)
  • 3 edges (gene pairs with ≥5 shared papers)

Top interacting genes:

  • IL6 (55 papers)
  • IL1B (32 papers)
  • IL10 (24 papers)

Output Files

The script creates a directory in /results with the following structure:

/results/gene_<GENE>_<YEAR_START>_<YEAR_END>_<ORGANISM>/
├── network_data.json    # Complete network data with metadata
├── nodes.csv           # List of genes with paper counts
└── edges.csv           # Network edges with shared paper counts

network_data.json

Contains complete network information:

{
  "metadata": {
    "starting_gene": "IL17A",
    "starting_gene_id": 3605,
    "num_seed_papers": 929,
    "year_start": 2014,
    "year_end": 2024,
    "organism_mode": "map_to_human",
    "num_genes": 28,
    "num_edges": 10
  },
  "nodes": [
    {
      "gene_id": 112744,
      "symbol": "IL17F",
      "name": "interleukin 17F",
      "total_pmids": [...],
      "human_pmids": [...],
      "ortholog_pmids": [...],
      "total_count": 100,
      "human_count": 100,
      "ortholog_count": 0
    },
    ...
  ],
  "edges": [
    {
      "gene1": 23765,
      "gene2": 112744,
      "shared_pmids": [...],
      "weight": 10
    },
    ...
  ]
}

Visualization

The gene-based networks work with the existing visualization tools:

Generate DOT file for Graphviz

python code/5_export_dot.py /results/gene_IL17A_2014_2024_map_to_human/network_data.json

This creates network.dot which can be visualized with Graphviz:

dot -Tpng network.dot -o network.png
# or
dot -Tsvg network.dot -o network.svg

Generate HTML page

python code/6_generate_html.py /results/gene_IL17A_2014_2024_map_to_human/network_data.json

This generates a single-file HTML (index.html) with:

  • Network overview with statistics
  • Genes table with clickable paper counts
  • Edges table with clickable shared paper counts
  • Direct PubMed links: Click any paper count to open all those papers in PubMed
    • Green numbers = all papers (total)
    • Blue numbers = human gene papers only
    • Orange numbers = ortholog gene papers only

Note: The old multi-file HTML generator (with separate pages per gene/edge) is available as 6_generate_html_multifile.py if needed.

Differences from MeSH-based Networks

Feature MeSH-based (4_build_network.py) Gene-based (build_gene_interaction_network.py)
Starting point MeSH term (e.g., "Chordoma") Gene symbol (e.g., "IL17A")
Seed papers Papers tagged with MeSH term Papers associated with gene
Network type Co-citation within topic Interaction partners of gene
Use case Topic-focused research Gene-centered analysis
Metadata field mesh_term starting_gene, starting_gene_id, num_seed_papers

Implementation Details

Gene Lookup

The script automatically:

  1. Looks up the gene ID from the symbol (case-insensitive)
  2. Validates that the gene exists in the database
  3. Suggests similar gene symbols if not found

Paper Retrieval

Papers are retrieved in two steps:

  1. Get all PMIDs for the starting gene from gene_pubmed table
  2. If year filters are specified, filter PMIDs using the pubmed_articles table

Network Construction

The network construction follows the same logic as MeSH-based networks:

  1. Find all genes co-cited in seed papers
  2. Support both human_only and map_to_human modes
  3. Filter genes by minimum paper count
  4. Build edges based on shared PMIDs
  5. Filter edges by minimum shared paper count

Including/Excluding the Starting Gene

By default, the starting gene IS INCLUDED in the network. This is important because:

  1. The starting gene acts as a hub connecting other genes
  2. Without it, many edges would be lost (only showing secondary connections)
  3. It provides context for understanding the network structure
  4. You can see how strongly other genes are co-cited with your gene of interest

Example with IL17A:

  • With IL17A included (default): 38 edges
  • With IL17A excluded (--exclude-seed): 10 edges

Use --exclude-seed only if you specifically want to see how the other genes relate to each other, independent of the starting gene.

Tips

  1. Start broad, then narrow: Begin with default parameters, then increase thresholds to reduce network size
  2. Use year filters: Recent papers may show emerging interactions
  3. Compare organism modes: map_to_human provides more coverage, human_only is more specific
  4. Adjust thresholds: For highly studied genes, increase --min-papers-per-gene to focus on strongest interactions

Integration with Existing Pipeline

The gene-based network builder integrates seamlessly with existing tools:

  • ✓ Uses the same database schema
  • ✓ Produces compatible JSON format
  • ✓ Works with 5_export_dot.py for DOT export
  • ✓ Works with 6_generate_html.py for HTML generation
  • ✓ Follows the same filtering and ortholog mapping logic