Predicting drug–gene relations via analogy tasks with word embeddings
Hiroaki Yamagiwa, Ryoma Hashimoto, Kiwamu Arakane, Ken Murakami, Shou Soeda, Momose Oyama, Yihua Zhu, Mariko Okada, Hidetoshi Shimodaira
Scientific Reports 15, 17240 (2025) [arXiv]
The code is meant to run inside Docker. If you prefer other setups, install the packages listed in requirements.txt.
$ bash scripts/docker/build.sh$ bash scripts/docker/run.shDownload the BioConceptVec skip-gram embeddings:
$ mkdir -p data/embeddings
$ wget -c -O data/embeddings/concept_skipgram.json https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_skip.jsonSkip-gram embeddings trained on PubMed abstracts in 5-year windows from 1970 are available on Google Drive.
See README.prepare.md for the full preprocessing pipeline.
Pre-generated data are already placed in the data/ directory.
Fig. 2a: PCA of drugs and genes for a randomly selected relation
$ python Fig2a.pyFig. 2b: PCA of drugs and genes classified to the ErbB signaling pathway
$ python Fig2b.py $ python eval_analogy.py
$ python eval_analogy_Y1.py
$ python eval_analogy_Y2.py
$ python eval_analogy_P1Y1_and_P2Y1.py
$ python eval_analogy_P1Y2_and_P2Y2.pyResults produced with the OpenAI API are stored under output/analogy_API/ and are loaded by default:
$ python eval_OpenAI_API.pyIf you wish to rerun the API experiments, adjust the scripts as needed.
$ python Fig3_FigS4.py$ python Table4.pyGenerate analogy-based predictions using 10 %, 20 %, … 60 % of the training data:
$ python eval_analogy_for_comparing_with_TransE.pyCompare them with TransE results:
$ python Fig4.py(TransE scores are currently hard-coded inside Fig4.py; a dedicated script will be released later.)
- Chen et al. Bioconceptvec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol. (2020).
Distribution of answer-set sizes:
$ python FigS2_S3.pySearch-result rank by answer-set size:
$ python FigS5.pyWeighted correlations are computed with the WeightedCorr repository.
@article{Yamagiwa2025,
title = {Predicting drug–gene relations via analogy tasks with word embeddings},
author = {Yamagiwa, Hiroaki and Hashimoto, Ryoma and Arakane, Kiwamu and Murakami, Ken and Soeda, Shou and Oyama, Momose and Zhu, Yihua and Okada, Mariko and Shimodaira, Hidetoshi},
journal = {Scientific Reports},
volume = {15},
number = {1},
pages = {17240},
year = {2025},
month = {May},
doi = {10.1038/s41598-025-01418-z},
url = {https://doi.org/10.1038/s41598-025-01418-z},
issn = {2045-2322}
}- Embedding URLs may change; please refer to the GitHub repository rather than the raw download link.
- This directory was created by Hiroaki Yamagiwa.
- Embeddings were trained by Ryoma Hashimoto.
- KEGG ID to MeSH ID conversion (prepare_kegg2mesh.ipynb) was implemented by Kiwamu Arakane.
- TransE prediction experiments were conducted by Yihua Zhu.
