Skip to content

Machine learning toolkit for predicting disease associated genetic variants

Li Chen edited this page Apr 18, 2020 · 10 revisions

Background

Due to the crucial role of genetic variants in disease onset, it is important to identify these disease-associated genetic variants accurately. However, there is no such R package available. In this GSOC project, we will invite students to tackle this problem by developing a machine learning toolkit consisting of two core novel machine learning algorithms to predict disease-associated genetic variants.

Details of your coding project

The purpose of this work is to provide R users with a comprehensive machine learning toolkit for predicting disease-associated genetic variants. It mainly consists of two core novel machine learning algorithms:

  • A weighted ensemble learning framework for predicting disease-associated genetic variants: The model will ensemble multiple score systems for predicting disease-associated genetic variants in a unified framework by developing a constrained penalized optimization algorithm.
  • A transfer learning framework based on convolutional neural network (CNN): The CNN is powerful when the sample size for a specific disease is small. In this case, CNN will be trained on an experiment-validated large-scale genetic variants from mixed diseases and fine-tuned using disease-specific genetic variants.

Details

A weighted ensemble learning framework

  • Build the probability density functions using precomputed scores from multiple scoring systems via kernel density estimation.
  • Implement constrained penalized optimization algorithm.
  • Design the simulation studies to test the model.
  • Test the model on some real datasets[1,2,3].

A transfer learning framework based on convolutional neural network (CNN)

  • Build a backbone convolutional neural network, which contains an embedding layer with different embedding sizes, 1D / 2D or dilated convolutional layers with different windows sizes and strides, max-pooling layers and fully-connected layers by using TensorFlow for R API. Different optimization methods such as Adam, Rmsprop, and SGD with momentum will be applied and trained separately, and the results will be compared.
  • The backbone CNN will be trained on an experiment-validated large-scale dataset [1] and will be fine-tuned on the different disease-specific datasets [2, 3].
  • Students will also be designing different strategies to fine-tune the CNN (eg. freeze some layers; re-train some layers)
  • Some popular methods such as Dropout (spatial), Batch normalization and regularization are also introduced to backbone DNN to avoid some common problems in neural networks such as overfitting and non-convergence.

Mentors

Li Chen [email protected] is a tenure-track Assistant Professor of Medicine and a member in the Center for Computational Biology and Bioinformatics at Indiana University School of Medicine (IUSM). He was a previous mentor of GSOC and published multiple R CRAN packages.

Xiao Qin [email protected] is a full Professor in Department of Computer Science and Software Engineering at Auburn University.

Tests for potential students

Easy

  • Can you explain what is the convolutional neural network?
  • Can you explain what is the transfer learning?
  • Can you install TensorFlow in your machine and implement a simple CNN on MNIST by TF estimator API following the official document?

Medium

  • What are overfitting?
  • What are L1 and L2 regularization?
  • What should we do if the loss doesn’t converge?
  • Can you implement a simple CNN without estimator API?

Reference

  • G. R. Ritchie, I. Dunham, E. Zeggini, and P. Flicek. Functional annotation of noncoding sequence variants. Nature methods, 11(3):294, 2014.
  • Wang, J., Dayem Ullah, A. Z., & Chelala, C. (2018). IW-Scoring: an Integrative Weighted Scoring framework for annotating and prioritizing genetic variations in the noncoding genome. Nucleic acids research, 46(8), e47-e47.
  • Chen L, Jin P, Qin ZS (2016). DIVAN: Accurate identification of non-coding disease-specific risk variants based on multi-omics profiles Genome Biology 17:252

Solutions of tests

Students, please post a link to your test results here.

  1. Name - Varad Srivastava, Test_Solution
  2. Name - Ye Wang, Test Solution
Clone this wiki locally