TCR LLMs for Cancer Prediction (Modified)

This repository is a modified version of Rudy C. Yuen’s MEng dissertation project, originally developed at UCL to explore language model-based TCR embeddings for cancer classification.

The current version was developed for a UCL BSc Computer Science dissertation, focused on investigating the use of sparse attention-based Multiple Instance Learning (MIL) and pretrained TCR embedding models for cancer classification.

Key extensions and contributions in this version include:

A custom data preprocessing pipeline (data-preprocessing/) for manually supplied alpha and beta chain files.
Split-by-chain training (alpha and beta chains processed independently).
Three new interpretability experiments on the SCEPTR model (Section 6.2 of the dissertation).
Updated results and comparative evaluations for symbolic vs subsymbolic encodings.

Warning

This code has been tested on Linux CentOS (UCL CS lab 105 Computers). Although it should work on other OS, it is not guaranteed to work perfectly.

Installation

Important

We developed the code under Python 3.11, and the requirements.txt has been generated in the same environment. Therefore installing the requirements may not work for Python versions below 3.11.

Download this repository
Create a Python Environment venv through python3 -m venv $YOUR-VENV-NAME-HERE$
Activate your virtual environment, and run the following command python -m pip install -r scripts/requirements.txt if your computer is a Windows Computer, and python -m pip install -r scripts/requirements-linux.txt if it is Linux Ubuntu instead.
Install SCEPTR via the following command: python -m pip install sceptr

Note

You should install your own version of PyTorch depending on your CUDA version before installing the requirements.txt. You may find instructions of installing PyTorch here.

Note

SCEPTR has been published officially here.

Preparing the Dataset

To process local TCR data files (manually provided), the following scripts are used from the data-preprocessing/ directory:

`data-preprocessing/`

select_files_for_eval.py
Randomly selects a subset of patients for evaluation, producing CSV files listing the selected cancer and control files separately for α and β chains.
move_eval_files.py
Moves selected evaluation files into a dedicated subdirectory to separate training/test data from evaluation data.
convert_to_sceptr_alpha.py and convert_to_sceptr_beta.py
Preprocess and clean α/β chain files into the SCEPTR-compatible format. This includes:
- Filtering for valid V/J gene calls and non-empty CDR3 sequences.
- Enforcing chain-specific functionality via tidytcells.
- Outputting a 6-column TSV file compatible with SCEPTR (TRAV, TRAJ, CDR3A, TRBV, TRBJ, CDR3B).

Additionally, to compress the data (i.e. removing all data other than V call, J call and CDR3 sequences), you may run

python utils/file-compressor.py

Evaluation and Train/Test Set Splits

The evaluation and training/test patient files are listed in the data-preprocessing/filenames/ directory.

Each subdirectory contains .csv files specifying the filenames for $\alpha$ and $\beta$ chain data, grouped by class (control vs cancer) and by usage split (train/test vs evaluation):

Evaluation set filenames were randomly selected using select_files_for_eval.py.

Downloading TCR-BERT and SCEPTR Models

Downloading TCR-BERT

To download the two variants of TCR-BERT, you may run the following command:

python loaders/load_tcrbert.py -o model

Downloading SCEPTR

Please refer to this link for installation instructions for SCEPTR.

Training Classifiers

There are 3 training scripts, where trainer-sceptr.py trains a classifier that uses SCEPTR to encode TCRs, trainer-symbolic.py trains a classifier that takes in TCRs encoded by physico-chemical properties and trainer-tcrbert.py trains a classifier that uses TCR-BERT to encode TCRs.

All of these 3 scripts would need to take in a configuration file, which can be generated by

python trainer.py --make --end

after replacing trainer.py with the appropriate training script. If you would want to run training using the default settings, you can run the following command instead.

python trainer.py --make

All scripts will generate a log file for its training process. You may change the log file's name with the following command.

python trainer.py --log-file custom-filename.log

Training Configurations

To modify the training configurations, you may modify the config.json generated from the command above. The configurations available for each of the 3 training scripts are different. You may find the description for each field in each training script as below:

trainer-sceptr.py: Descriptions Here
trainer-tcrbert.py: Descriptions Here
trainer-symbolic.py: Descriptions Here

To specify which configuration file to run, you may use the following command:

python trainer.py -c custom-configs.json

Tip

When you run multiple training instances and would like to check the progress of each training instance, you can run the following command to check.

python for-ssh/checkdone.py

It also tells you the time that the training has been stale for. It is recommended that you check the training instance if it has been stale for over 2 hours.

Analysing Training Instances

The original results from Rudy’s MEng project are retained under results/ for reference and comparison. All new experiments performed as part of this BSc dissertation—covering both $\alpha$ and $\beta$ chain models—are located in results-new-alpha/ and results-new-beta/.

Usage of the Evaluation Set

To test a model's performance on the evaluation set, you may use the following command after amending the model's directory and the best performing epoch.

python src/calculate_evals.py

Jupyter Notebooks

This repository includes several Jupyter notebooks to analyse training behaviour and interpret model predictions. Key notebooks include:

training-stats-analysis.ipynb: Generates the loss, accuracy and AUC graphs for one training instance.
training-stats-combined.ipynb: Generates the loss, accuracy and AUC graphs for a series of training instances.
training-stats-combined-alpha-vs-beta.ipynb (new): Extension of the above to directly compare $\alpha$ and $\beta$ models on the same plots.
eval-stats-combined.ipynb: Generates the confusion matrix and AUC curves for the models that are trained under a 3-way split.
training-stats-tables.ipynb (new): Produces tables summarising accuracy, loss, and AUC on training and test sets across all runs and embeddings.
sceptr-highweight-tcr-analysis.ipynb (new): Identifies high-attention TCRs that recur across cancer patients in the evaluation set.
sceptr-vector-alignment.ipynb (new): Computes cosine similarity and angle between scoring and classifying layer vectors, both within and across runs.
sceptr-umap-visualisation.ipynb (new): Visualises patient-level bag embeddings in 2D using UMAP, showing class separation.

This project builds upon the original work by RcwYuen and is used for academic research purposes. The modifications remain under the same MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCR LLMs for Cancer Prediction (Modified)

Installation

Preparing the Dataset

`data-preprocessing/`

Evaluation and Train/Test Set Splits

Downloading TCR-BERT and SCEPTR Models

Downloading TCR-BERT

Downloading SCEPTR

Training Classifiers

Training Configurations

Analysing Training Instances

Usage of the Evaluation Set

Jupyter Notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
data-preprocessing		data-preprocessing
for-ssh		for-ssh
instructions		instructions
loaders		loaders
results-new-alpha		results-new-alpha
results-new-beta		results-new-beta
results		results
scripts		scripts
src		src
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval-stats-combined.ipynb		eval-stats-combined.ipynb
sceptr-highweight-tcr-analysis.ipynb		sceptr-highweight-tcr-analysis.ipynb
sceptr-umap-visualisation.ipynb		sceptr-umap-visualisation.ipynb
sceptr-vector-alignment.ipynb		sceptr-vector-alignment.ipynb
training-stats-analysis.ipynb		training-stats-analysis.ipynb
training-stats-combined-alpha-vs-beta.ipynb		training-stats-combined-alpha-vs-beta.ipynb
training-stats-combined.ipynb		training-stats-combined.ipynb
training-stats-tables.ipynb		training-stats-tables.ipynb

Folders and files

Latest commit

History

Repository files navigation

TCR LLMs for Cancer Prediction (Modified)

Installation

Preparing the Dataset

data-preprocessing/

Evaluation and Train/Test Set Splits

Downloading TCR-BERT and SCEPTR Models

Downloading TCR-BERT

Downloading SCEPTR

Training Classifiers

Training Configurations

Analysing Training Instances

Usage of the Evaluation Set

Jupyter Notebooks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data-preprocessing/`

Packages