TCR2vec is a python software designed for embedding TCR sequences into numerical vectors. It is a transformer-based model that pretrained with MLM and SPM (similarity preservation modeling). After the multi-task pretraining stage, TCR2vec is able to transform amino acid sequences of TCRs into a similarity preserved embedding space with contextual understanding of the language of TCRs. Similar TCRs in sequence space have smaller Euclidean distances in vector space while divergent TCRs have larger Euclidean distances. The workflow of the pretraining process is shown below. TCR2vec can also be finetuned for better performance on task-specific data.
TCR2vec is writen in Python based on the deeplearning library - Pytorch. Compared to Tensorflow, Pytorch is more user-friendly in version compatibility. I would strongly suggest using Pytorch as the deeplearning library so that followers can easily run the code with less pain in making Tensorflow work.
The required software dependencies are listed below:
touch >= 1.1.0 (tested on 1.8.0)
cd TCR2vec
pip install .
Or you can directly install it as a PyPI package via
pip install tcr2vec
All the source data included in the paper is publicly available, so we suggest readers refer to the original papers for more details. We also uploaded the processed data to google drive which can be accessed via this link. For the pretraining data, please refer to the training repository.
We provide a simple code snip to show how to use TCR2vec for embedding TCRs, which is shown below:
import torch
from tcr2vec.model import TCR2vec
from tape import TAPETokenizer
path_to_TCR2vec = 'path_to_pretrained_TCR2vec'
emb_model = TCR2vec(path_to_TCR2vec)
tokenizer = TAPETokenizer(vocab='iupac')
#by default, the device for emb_model is cpu
#emb_model ='cuda:0') #to gpu
#example TCR
token_ids = torch.tensor([tokenizer.encode(seq)])
output = emb_model(token_ids) # shape of 1 x 120
#convert to numpy array
emb = output.detach().cpu().numpy()
#for a batch input:
from tcr2vec.dataset import TCRLabeledDset
from import DataLoader
from tcr2vec.utils import get_emb
dset = TCRLabeledDset([seq],only_tcr=True) #input a list of TCRs
loader = DataLoader(dset,batch_size=32,collate_fn=dset.collate_fn,shuffle=False)
emb = get_emb(emb_model,loader,detach=True) #B x emb_size
We also provide a python script in tcr2vec/ that uses the pretrained model to embed user's input file. The input file should be a csv file, with one column recording the input TCRs (By default, the column name is full_seq).
python --pretrain_path path_to_tcr2vec --dset_path path_to_data.csv --save_path path_to_save_embedding.npy
Also, check python --h for more details about input parameters.
The basic script is shown below:
python --dset_folder path_to_5fold_dir --pretrain_path path_to_TCRevec --c_method SVM
For more experiment settings, pleas enter python --h for details.
We provide the finetune code for classfication purpose. For writing your custom finetune code, make sure you set the model to training model (model.train())
python --path_train path_to_train --path_test path_to_test --epoch 20 --batch_size 64 --pretrain_path path_to_TCR2vec --save_path finetune_path.pth
Again, type python --h for details.
We provide the code to make prediction scores for TCR-epitope binding using the trained model from either finetuning or using SVM/MLP in
python --dset_path path_to_file --save_prediction_path path_to_save.txt --model_path path_to_finetune.pth
Again, type python --h for details.
TCR2vec_small(smaller network; less GPU memory needed; embedding size=128)
TCR2vec(pretrained on TCRdb)
CDR3vec (pretrained on CDR3 sequences)
CDR3vec_small(smaller network; less GPU memory needed; embedding size=128)
The full-length TCR can be recovered by knowing CDR3 + V/J. An example is shown below:
from tcr2vec.utils import cdr2full
samples = [['cdr3_1','v1','j1'],['cdr3_2','v2','j2']...]
full_seq = cdr2full(directory,samples,verbose=False,multi_process=True)
#directory: path to gene directory; e.g. tcr2vec/data/TCR_gene_segment_data
More information can be found on
If you want to re-train TCR2vec on our provided pretraining data or your custom data, please check the training code
- By default, the column names for CDR3, V/J genes, full TCRs are CDR3.beta, V, J, and full_seq
- For embdding evaluation, we recommend using the sklearnx to accelerate the sklearn models (by specifying --use_sklearnx True)
- Example scripts can be found under the scripts/
Name: Yuepeng Jiang
Email: [email protected]/[email protected]/[email protected]
Note: For instant query, feel free to send me an email since I check email often. Otherwise, you may open an issue section in this repository.
Free use of TCR2vec is granted under the terms of the GNU General Public License version 3 (GPLv3).