Making supervised machine learning accessible to non-data scientists interested in microbiome research.
Four open source microbiome datasets were obtained and processed in Qiita [1]. The full bioinformatic pipeline was conducted in Qiita with QIIME2 [2] and can be found here, and more information on the datasets as a whole can be found in the Functional Specifications, Component Specifications, and our Homepage!
↓↓↓↓↓↓↓↓↓↓↓↓↓↓
| Project Homepage |
|---|
This package assumes you have Anaconda/Miniconda and git installed prior to starting!
$ conda create -n biome_env python=3
In the above code, I have created a new environment and named it biome_env while also specifying that I want to be using python version 3 (incase you have both installed).
In the terminal we will first active our new environment biome_env, change our directory, and then clone the repo! For this tutorial, I will be cloning the BioME to my Desktop. If you aren't sure where you are in your computer, don't worry! Type pwd (this means print working directory) into the terminal. This will tell you where you are! Now that might not be very helpful if you don't know whats in the directory! Type ls (list) in the terminal. This will tell you what is in your terminal. Once you have figured out where you are you can now use cd which means change directory. In this case, I'm just moving "forward" into a folder in this directory (my desktop!).
$ conda activate BioME_env
$ cd Desktop
$ git clone https://github.com/kmherman/BioME.git
$ cd BioME
Note: If the above worked fine then you don't need this next bit of code and you can skip down to Step 3. If you the install_requires in the setup.py file isn't install the needed packages, please use this command to install the packages prior to setting up BioME:
$ conda install numpy, pytorch, pandas, scikit-learn
$ cd Desktop
$ git clone https://github.com/kmherman/BioME.git
$ cd BioME
Now take a look on your desktop screen. You'll now see the BioME file!
Now we just need to run the "setup .py" and you are good to go!
Note: once this installs, you should be able to run the command from anywhere on your computer as long as biome_env is activated!
$ python setup.py install
For this tutorial we will be using the data from provided in BioME. You can get an over view of the directory structure below ↓↓↓↓↓↓↓
Note: Your dataset should have a minimum of ~50 sample for machine learning classifiers. Metadata columns that are used as the classifying variable should have a minimum of 10 samples per unique value. Smaller counts will result in inaccurate models.
We are going to assume that you are still in the BioME directory from above, but if not you'll just have to change the file path to the dataset!
$ biome_run.py
Beautiful
(Write the path to the "bug" (OTU) table) If you look in the data folder, you'll see three files:
- "bug_OTU_rel.tsv" an OTU table with OTU counts in relative abundance
- "bug_OTU_raw.tsv" an OTU table with OTU counts in raw abundance
- "query_point.tsv" a single sample's OTU abundance (just for demonstration)
- "FecesMeta.txt" a metadata file with our categorical data. Take a look at the Functional Specifications doc to learn more about the data!
Write the path to the OTU table you choose to use. I'll be using the relative abundance
Note: Below all prompts are CASE-SENSITIVE!!! In the prompt write:
: biome/Data/bug_OTU_rel.tsv
Tip: Instead of writing it all out you can drag the folder into the terminal and finish by writing the exact file.. at least on a mac...
(Write the path to the meta data file)
: biome/Data/FecesMeta.txt
(Provide the categories that will be used for classification) In this instance, we will be using the "ML_diagnosis" column in the metadata file. Healthy Humans (HC) and individuals with either Crohn's Disease (CD), Ulcerative Colitis (UC), Collagenous Colitis (CC), or Ileal Crohn's disease (IC). Full collection, DNA extraction, and 16sRNA amplicon sequencing methodologies for each study can be found in the provided papers [3-5].
Lists need to be comma-separated with no spaces
: HC,CD,UC,CC,IC
Some of these of models can take a long-time, which isn't to mean they aren't excellent options to run! But for the tutorial we will only run a few. Remember, no spaces!
- Machine learning algorithm abbreviations:
- mlp1: Multilayer perceptron with 1 hidden layer
- mlp3: Multilayer perceptron with 3 hidden layers
- lr: logistic regression
- rr: ridge classifier (L2 regularizer)
- dtree: decision tree
- svc: support vector classifier
- knn: k-nearest neighbors algorithm (implemented with PCA)
- forest: random forest
- gnb: Gaussian Naive-Bayes
- all: train and evaluate every machine learning algorithm available (all above)
How fast does each algorithm run?
| Algorithm | Rank* |
|---|---|
| mlp1 | 7 |
| mlp3 | 8 |
| lr | 3 |
| rr | 1 |
| dtree | 9 |
| svc | 6 |
| knn | 2 |
| forest | 5 |
| gnb | 4 |
*For more information on these ML algorithms, see the Project Homepage
: dtree,mlp1,mlp3
Looks like the best model was the mlp1: Multilayer perceptron with 1 hidden layers. Now you just need to decide if you want to use it to predict. If you do type "yes" or "Yes":
? yes
Here we will be using the "query_point.tsv" file!
: biome/Data/query_point.tsv
Looks like CD, which looking at our FeceMeta file, is right!
Don't want the tutorial to end? Check out the Demo at the Project Homepage!
Enjoy!
[1] Antonio Gonzalez, Jose A. Navas-Molina, Tomasz Kosciolek, Daniel McDonald, Yoshiki Vázquez-Baeza, Gail Ackermann, Jeff DeReus, Stefan Janssen, Austin D. Swafford, Stephanie B. Orchanian, Jon G. Sanders, Joshua Shorenstein, Hannes Holste, Semar Petrus, Adam Robbins-Pianka, Colin J. Brislawn, Mingxun Wang, Jai Ram Rideout, Evan Bolyen, Matthew Dillon, J. Gregory Caporaso, Pieter C. Dorrestein & Rob Knight. Qiita: rapid, web-enabled microbiome meta-analysis. Nature Methods, volume 15, pages 796–798 (2018); https://doi.org/10.1038/s41592-018-0141-9 Qiita website
[2] Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. 2019. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9 QIIME2 website
[3] Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, Schwager E, Knights D, Song SJ, Yassour M, Morgan XC, Kostic AD, Luo C, González A, McDonald D, Haberman Y, Walters T, Baker S, Rosh J, Stephens M, Heyman M, Markowitz J, Baldassano R, Griffiths A, Sylvester F, Mack D, Kim S, Crandall W, Hyams J, Huttenhower C, Knight R, Xavier RJ.The treatment-naive microbiome in new-onset Crohn's disease. Cell Host Microbe.. 2014 Mar 12;15(3):382-392. doi: 10.1016/j.chom.2014.02.005. PMID: 24629344; PMCID: PMC4059512. https://pubmed.ncbi.nlm.nih.gov/24629344/
[4] Daniel McDonald, Embriette Hyde, Justine W. Debelius, James T. Morton, Antonio Gonzalez, Gail Ackermann, Alexander A. Aksenov, Bahar Behsaz, Caitriona Brennan, Yingfeng Chen, Lindsay DeRight Goldasich, Pieter C. Dorrestein, Robert R. Dunn, Ashkaan K. Fahimipour, James Gaffney, Jack A. Gilbert, Grant Gogul, Jessica L. Green, Philip Hugenholtz, Greg Humphrey, Curtis Huttenhower, Matthew A. Jackson, Stefan Janssen, Dilip V. Jeste, Lingjing Jiang, Scott T. Kelley, Dan Knights, Tomasz Kosciolek, Joshua Ladau, Jeff Leach, Clarisse Marotz, Dmitry Meleshko, Alexey V. Melnik, Jessica L. Metcalf, Hosein Mohimani, Emmanuel Montassier, Jose Navas-Molina, Tanya T. Nguyen, Shyamal Peddada, Pavel Pevzner, Katherine S. Pollard, Gholamali Rahnavard, Adam Robbins-Pianka, Naseer Sangwan, Joshua Shorenstein, Larry Smarr, Se Jin Song, Timothy Spector, Austin D. Swafford, Varykina G. Thackray, Luke R. Thompson, Anupriya Tripathi, Yoshiki Vázquez-Baeza, Alison Vrbanac, Paul Wischmeyer, Elaine Wolfe, Qiyun Zhu, The American Gut Consortium, Rob Knight. American Gut: an Open Platform for Citizen Science Microbiome Research. mSystems May 2018, 3 (3) e00031-18; DOI: 10.1128/mSystems.00031-18 https://msystems.asm.org/content/3/3/e00031-18
[5] Halfvarson, J., Brislawn, C., Lamendella, R. et al. Dynamics of the human gut microbiome in inflammatory bowel disease. Nat Microbiol 2, 17004 (2017). https://doi.org/10.1038/nmicrobiol.2017.4


