TCR2vec is a python software designed for embedding TCR sequences into numerical vectors. It is a transformer-based model that pretrained with MLM and SPM (similarity preservation modeling). After the multi-task pretraining stage, TCR2vec is able to transform amino acid sequences of TCRs into a similarity preserved embedding space with contextual understanding of the language of TCRs. Similar TCRs in sequence space have smaller Euclidean distances in vector space while divergent TCRs have larger Euclidean distances. The workflow of the pretraining process is shown below. TCR2vec can also be finetuned for better performance on task-specific data.
TCR2vec is writen in Python based on the deeplearning library - Pytorch. Compared to Tensorflow, Pytorch is more user-friendly in version compatibility. I would strongly suggest using Pytorch as the deeplearning library so that followers can easily run the code with less pain in making Tensorflow work.
The required software dependencies are listed below:
tqdm
scipy
biopython
matplotlib
touch >= 1.1.0 (tested on 1.8.0)
pandas
numpy
sklearn
tape_proteins
cd TCR2vec
pip install .
Or you can directly install it as a PyPI package via
pip install tcr2vec
All the source data included in the paper is publicly available, so we suggest readers refer to the original papers for more details. We also uploaded the processed data to google drive which can be accessed via this link. For the pretraining data, please refer to the training repository.
We provide a simple code snip to show how to use TCR2vec for embedding TCRs, which is shown below:
import torch
from tcr2vec.model import TCR2vec
from tape import TAPETokenizer
path_to_TCR2vec = 'path_to_pretrained_TCR2vec'
emb_model = TCR2vec(path_to_TCR2vec)
tokenizer = TAPETokenizer(vocab='iupac')
#by default, the device for emb_model is cpu
#emb_model = emb_model.to('cuda:0') #to gpu
#example TCR
seq = 'NAGVTQTPKFQVLKTGQSMTLQCAQDMNHNSMYWYRQDPGMGLRLIYYSASEGTTDKGEVPNGYNVSRLNKREFSLRLESAAPSQTSVYFCASSEALGTGNTIYFGEGSWLTVV'
token_ids = torch.tensor([tokenizer.encode(seq)])
output = emb_model(token_ids) # shape of 1 x 120
#convert to numpy array
emb = output.detach().cpu().numpy()
#for a batch input:
from tcr2vec.dataset import TCRLabeledDset
from torch.utils.data import DataLoader
from tcr2vec.utils import get_emb
dset = TCRLabeledDset([seq],only_tcr=True) #input a list of TCRs
loader = DataLoader(dset,batch_size=32,collate_fn=dset.collate_fn,shuffle=False)
emb = get_emb(emb_model,loader,detach=True) #B x emb_size
We also provide a python script embed.py in tcr2vec/ that uses the pretrained model to embed user's input file. The input file should be a csv file, with one column recording the input TCRs (By default, the column name is full_seq).
python embed.py --pretrain_path path_to_tcr2vec --dset_path path_to_data.csv --save_path path_to_save_embedding.npy
Also, check python embed.py --h for more details about input parameters.
The basic script is shown below:
python evaluate.py --dset_folder path_to_5fold_dir --pretrain_path path_to_TCRevec --c_method SVM
For more experiment settings, pleas enter python evaluate.py --h for details.
We provide the finetune code for classfication purpose. For writing your custom finetune code, make sure you set the model to training model (model.train())
python finetune.py --path_train path_to_train --path_test path_to_test --epoch 20 --batch_size 64 --pretrain_path path_to_TCR2vec --save_path finetune_path.pth
Again, type python finetune.py --h for details.
We provide the code to make prediction scores for TCR-epitope binding using the trained model from either finetuning or using SVM/MLP in evaluate.py.
python predict.py --dset_path path_to_file --save_prediction_path path_to_save.txt --model_path path_to_finetune.pth
Again, type python predict.py --h for details.
TCR2vec
TCR2vec_small(smaller network; less GPU memory needed; embedding size=128)
TCR2vec(pretrained on TCRdb)
CDR3vec (pretrained on CDR3 sequences)
CDR3vec_small(smaller network; less GPU memory needed; embedding size=128)
The full-length TCR can be recovered by knowing CDR3 + V/J. An example is shown below:
from tcr2vec.utils import cdr2full
samples = [['cdr3_1','v1','j1'],['cdr3_2','v2','j2']...]
full_seq = cdr2full(directory,samples,verbose=False,multi_process=True)
#directory: path to gene directory; e.g. tcr2vec/data/TCR_gene_segment_data
More information can be found on utils.py
If you want to re-train TCR2vec on our provided pretraining data or your custom data, please check the training code
- By default, the column names for CDR3, V/J genes, full TCRs are CDR3.beta, V, J, and full_seq
- For embdding evaluation, we recommend using the sklearnx to accelerate the sklearn models (by specifying --use_sklearnx True)
- Example scripts can be found under the scripts/
Name: Yuepeng Jiang
Email: [email protected]/[email protected]/[email protected]
Note: For instant query, feel free to send me an email since I check email often. Otherwise, you may open an issue section in this repository.
Free use of TCR2vec is granted under the terms of the GNU General Public License version 3 (GPLv3).