-
Notifications
You must be signed in to change notification settings - Fork 7
Machine learning toolkit for predicting disease associated genetic variants
Due to the crucial role of genetic variants in disease onset, it is important to identify these disease-associated genetic variants accurately. However, there is no such R package available. In this GSOC project, we will invite students to tackle this problem by developing a machine learning toolkit consisting of two core novel machine learning algorithms to predict disease-associated genetic variants.
The purpose of this work is to provide R users with a comprehensive machine learning toolkit for predicting disease-associated genetic variants. It mainly consists of two core novel machine learning algorithms:
- A weighted ensemble learning framework for predicting disease-associated genetic variants: The model will ensemble multiple score systems for predicting disease-associated genetic variants in a unified framework by developing a constrained penalized optimization algorithm.
- A transfer learning framework based on convolutional neural network (CNN): The CNN is powerful when the sample size for a specific disease is small. In this case, CNN will be trained on an experiment-validated large-scale genetic variants from mixed diseases and fine-tuned using disease-specific genetic variants.
- Build the probability density functions using precomputed scores from multiple scoring systems via kernel density estimation.
- Implement constrained penalized optimization algorithm.
- Design the simulation studies to test the model.
- Test the model on some real datasets[1,2,3].
- Build a backbone convolutional neural network, which contains an embedding layer with different embedding sizes, 1D / 2D or dilated convolutional layers with different windows sizes and strides, max-pooling layers and fully-connected layers by using TensorFlow for R API. Different optimization methods such as Adam, Rmsprop, and SGD with momentum will be applied and trained separately, and the results will be compared.
- The backbone CNN will be trained on an experiment-validated large-scale dataset [1] and will be fine-tuned on the different disease-specific datasets [2, 3].
- Students will also be designing different strategies to fine-tune the CNN (eg. freeze some layers; re-train some layers)
- Some popular methods such as Dropout (spatial), Batch normalization and regularization are also introduced to backbone DNN to avoid some common problems in neural networks such as overfitting and non-convergence.
Li Chen [email protected] is a tenure-track Assistant Professor of Medicine and a member in the Center for Computational Biology and Bioinformatics at Indiana University School of Medicine (IUSM). He was a previous mentor of GSOC and published multiple R CRAN packages.
Xiao Qin [email protected] is a full Professor in Department of Computer Science and Software Engineering at Auburn University.
- Can you explain what is the convolutional neural network?
- Can you explain what is the transfer learning?
- Can you install TensorFlow in your machine and implement a simple CNN on MNIST by TF estimator API following the official document?
- What are overfitting?
- What are L1 and L2 regularization?
- What should we do if the loss doesn’t converge?
- Can you implement a simple CNN without estimator API?
- G. R. Ritchie, I. Dunham, E. Zeggini, and P. Flicek. Functional annotation of noncoding sequence variants. Nature methods, 11(3):294, 2014.
- Wang, J., Dayem Ullah, A. Z., & Chelala, C. (2018). IW-Scoring: an Integrative Weighted Scoring framework for annotating and prioritizing genetic variations in the noncoding genome. Nucleic acids research, 46(8), e47-e47.
- Chen L, Jin P, Qin ZS (2016). DIVAN: Accurate identification of non-coding disease-specific risk variants based on multi-omics profiles Genome Biology 17:252
Students, please post a link to your test results here.
- Name - Varad Srivastava, Test_Solution
- Name - Ye Wang, Test Solution