Skip to content
Lewis Moffat edited this page Aug 31, 2017 · 1 revision

Welcome to the tcRIP wiki!

This is a short page that describes what is happening in each file and the distinction between some of the file names

Underlying Code blocks (UL)

These scripts contain a number of class definitions and methods that are leveraged by some of the ML scripts to perform classification.

  • UL_atchFactors.py
    • Contains the Atchely factors and the corresponing function to get the five factors for a given AA through use of a dictionary
  • UL_autoencoderModel.py
    • Contains the base class used by ML_AutoEncoder.py that runs an autoencoder to reconstruct CDR3 sequences that are already feature engineered
  • UL_HMM.py
    • Contains a base class for a standard HMM written by Michael Hamilton (hamiltom@cs.colostate.edu) with some custom edits. It was never used for classification due to time constraints
  • UL_ProtVec.py
    • Contains the code for training a Skip-Gram GloVe custom embedding on the data set. The data retrieval methods need to be updated
  • UL_rnnModel.py
    • Contains the base class for an Recurrent Neural Network Model that is used in several scripts for classifying an inputted numeric sequence as either CD4 or CD8

Statistics Based Code (ST)

More accurately, these scripts contain a variety of data exploration experiments. The results fall under the data exploration section in the results section of the thesis.

  • ST_AAposition.py
    • Calculates the frequentist probability of each amino acids having been used in each position of a set length sequence. It then produces heat maps of this usage and a heat map of the difference.
  • ST_CMV.py
    • This generates a t-SNE plot of Atchley vectorized 13 long sequences along with known CMV CDR3 sequences.
  • ST_General.py
    • Gets general statistics on the dataset like histograms of length and the number of shared sequences between the classes.
  • ST_pTuple.py
    • Calculates the most common sub-sequences (p-Tuples) within CDR3s. It repeats this with CDR3s that have been clipped by a small number of amino acids (e.g. 4) from both the N & C-terminal.
  • ST_VJ.py
    • Generates histograms of V and J gene usage as well as producing a t-SNE diagram of all CDR3s as well as CDR3s from the most common V-J combination. It also classifies the sequences using just V gene index, J gene index, and both indices as features.

Machine Learning Scripts (ML)

These scripts all contain Machine Learning approaches to classification. In some cases it is just feature reduction. All classification scripts will contain some classifiers that are commented out. Comment back in to run them. They will also contain switches to include or preclude CDR1/2 and the V gene as features.

  • ML_AAprob.py
    • This uses a custom function to calculate the positional probability for both classes of CDR3 (and CDR1/2) and uses this to feature engineer and classify the sequences.
  • ML_AutoEncoder.py
    • This script run an AutoEncoder to reconstruct already feature engineered CDR3 sequences. It used the base class from UL_autoencoderModel.py
  • ML_BayesFeatReduc
    • This is an almost complete script that uses a 1-D gaussian naive bayes classifier to decided the value of each potential tuple in p-Tuple feature engineering towards classification. This is based on the work of Mattia Cinelli.
  • ML_DataEfficiency.py
    • Plots of learning curves using the Li et al. method and an XGBoost classifier 100 times to analyze the amount of data needed to achieve the best classification results observed.
  • ML_EnsembleNet.py
    • Creates a feed forward neural network classifier that that has separate layers that bottleneck to a 2D softmax layer for each CDR and the V-J genes. This values are then connected to the same layer and then continued to produce a final prediction. Very works reasonably well.
  • ML_HMM.py
    • Incomplete attempt to train to HMMs using the HMMlearn library. These would be used for classification. Does not work due to a bug preventing the input working correctly with the HMM class. Needs future work.
  • ML_Li_AA.py
    • This is a classification script that uses both the positional probability and Li et al. feature engineering methods concatenated together.
  • ML_Li.py
    • This contains the most successful classification which uses the Li et al. Atchley vector based feature engineering approach combine with an XGBoost classifier.
  • ML_MetricsFeatReduc.py
    • This produces plots and values of the importance of features in Atchley vectors for CDRs using Mutual Importance and ANOVA F-value.
  • ML_ProtVec.py
    • Runs classification on CDRs feature engineered using protein embeddings.
  • ML_Ptuple.py
    • Runs classification on CDRs feature engineered using p-Tuple frequency vectors.
  • ML_ReconAutoEncoder.py
    • Runs an AutoEncoder to reconstruct CDR sequences that are one-hot encoded. Then uses an Adaboost classifier to classify the sequences based on their encoded latent representation.
  • ML_VarAutoEncoder.py
    • This contains a version of the ReconAutoEncoder.py that has instead been built as a Variational AutoEncoder. Works just as well but can be sampled from. Interesting for future investigation.