This repository contains code and data relative to paper "High throughput genomic feature extraction reveals prokaryotic adaptations to the abiotic environment", by Maria Beatriz Walter Costa, Rose Brouns, Aristeidis Litos, Heyde França, Maria Schreiber, Francesco Bisiach, Bas E. Dutilh.
Scripts are located in folder scripts/.
Description of files in data/ folder follows below. You can uncompress .tar.gz files in the terminal with: tar -xzvf FILENAME.tar.gz
df_bacdive.tar.gz: contains 91,228 rows with prokaryotic isolates of BacDive and 13 columns with the taxonomy, genome assembly ID, and metadata on abiotic growth factors. Metadata contains minimum and maximum reported values separated by a minus sign. Filters described in the paper (see Material and Methods) were applied.- All files in sub-folders
oxygen,pH,saltandtemperaturecontain pickle.zst formatted files for the development of machine learning models of classification and regression for the following feature types: aminoacid frequencies, eggNOG COGs, kmer profiles (k = 9) and ncRNA families. If you want to predict the abiotic growth factors of a new genome or MAG, you could use these as training set. Note that all classification models (including oxygen) were built upon two contrasting classes for the purpose of investigating general underlying biological mechanisms. If you wish to predict phenotypes of new isolates, use the regression files. For oxygen, you should develop a new classification model based upon the data of filedf_bacdive.tar.gz, which contains aerobic and anaerobic as well as intermediate classes.