TCR2vec

TCR2vec is a python software designed for embedding TCR sequences into numerical vectors. It is a transformer-based model that pretrained with MLM and SPM (similarity preservation modeling). After the multi-task pretraining stage, TCR2vec is able to transform amino acid sequences of TCRs into a similarity preserved embedding space with contextual understanding of the language of TCRs. Similar TCRs in sequence space have smaller Euclidean distances in vector space while divergent TCRs have larger Euclidean distances. The workflow of the pretraining process is shown below. TCR2vec can also be finetuned for better performance on task-specific data.

Dependencies

TCR2vec is writen in Python based on the deeplearning library - Pytorch. Compared to Tensorflow, Pytorch is more user-friendly in version compatibility. I would strongly suggest using Pytorch as the deeplearning library so that followers can easily run the code with less pain in making Tensorflow work.

The required software dependencies are listed below:

tqdm
scipy
biopython
matplotlib
touch >= 1.1.0 (tested on 1.8.0) 
pandas 
numpy 
sklearn
tape_proteins

Installation

cd TCR2vec
pip install .

Or you can directly install it as a PyPI package via

pip install tcr2vec

Data

All the source data included in the paper is publicly available, so we suggest readers refer to the original papers for more details. We also uploaded the processed data to google drive which can be accessed via this link. For the pretraining data, please refer to the training repository.

Usages of TCR2vec

Embedding TCRs

We provide a simple code snip to show how to use TCR2vec for embedding TCRs, which is shown below:

import torch
from tcr2vec.model import TCR2vec
from tape import TAPETokenizer

path_to_TCR2vec = 'path_to_pretrained_TCR2vec'
emb_model = TCR2vec(path_to_TCR2vec)
tokenizer = TAPETokenizer(vocab='iupac') 
#by default, the device for emb_model is cpu
#emb_model = emb_model.to('cuda:0') #to gpu

#example TCR
seq = 'NAGVTQTPKFQVLKTGQSMTLQCAQDMNHNSMYWYRQDPGMGLRLIYYSASEGTTDKGEVPNGYNVSRLNKREFSLRLESAAPSQTSVYFCASSEALGTGNTIYFGEGSWLTVV'
token_ids = torch.tensor([tokenizer.encode(seq)])
output = emb_model(token_ids) # shape of 1 x 120

#convert to numpy array
emb = output.detach().cpu().numpy()    

#for a batch input:
from tcr2vec.dataset import TCRLabeledDset
from torch.utils.data import DataLoader
from tcr2vec.utils import get_emb

dset = TCRLabeledDset([seq],only_tcr=True) #input a list of TCRs
loader = DataLoader(dset,batch_size=32,collate_fn=dset.collate_fn,shuffle=False)
emb = get_emb(emb_model,loader,detach=True) #B x emb_size

We also provide a python script embed.py in tcr2vec/ that uses the pretrained model to embed user's input file. The input file should be a csv file, with one column recording the input TCRs (By default, the column name is full_seq).

python embed.py --pretrain_path path_to_tcr2vec --dset_path path_to_data.csv --save_path path_to_save_embedding.npy

Also, check python embed.py --h for more details about input parameters.

Evaluation of embeddings

The basic script is shown below:

python evaluate.py --dset_folder path_to_5fold_dir --pretrain_path path_to_TCRevec --c_method SVM

For more experiment settings, pleas enter python evaluate.py --h for details.

Finetune of TCR2vec

We provide the finetune code for classfication purpose. For writing your custom finetune code, make sure you set the model to training model (model.train())

python finetune.py --path_train path_to_train --path_test path_to_test --epoch 20 --batch_size 64 --pretrain_path path_to_TCR2vec --save_path finetune_path.pth

Again, type python finetune.py --h for details.

Use trained models to make predictions

We provide the code to make prediction scores for TCR-epitope binding using the trained model from either finetuning or using SVM/MLP in evaluate.py.

python predict.py --dset_path path_to_file --save_prediction_path path_to_save.txt --model_path path_to_finetune.pth

Again, type python predict.py --h for details.

Download of pretrained TCR2vec models

TCR2vec
TCR2vec_small(smaller network; less GPU memory needed; embedding size=128)
TCR2vec(pretrained on TCRdb)

CDR3vec (pretrained on CDR3 sequences)
CDR3vec_small(smaller network; less GPU memory needed; embedding size=128)

Reconstruction of full TCR

The full-length TCR can be recovered by knowing CDR3 + V/J. An example is shown below:

from tcr2vec.utils import cdr2full
samples = [['cdr3_1','v1','j1'],['cdr3_2','v2','j2']...]
full_seq = cdr2full(directory,samples,verbose=False,multi_process=True)
#directory: path to gene directory; e.g. tcr2vec/data/TCR_gene_segment_data

More information can be found on utils.py

Pretraining of TCR2vec

If you want to re-train TCR2vec on our provided pretraining data or your custom data, please check the training code

Notes

By default, the column names for CDR3, V/J genes, full TCRs are CDR3.beta, V, J, and full_seq
For embdding evaluation, we recommend using the sklearnx to accelerate the sklearn models (by specifying --use_sklearnx True)
Example scripts can be found under the scripts/

Contact

Name: Yuepeng Jiang
Email: [email protected]/[email protected]/[email protected]
Note: For instant query, feel free to send me an email since I check email often. Otherwise, you may open an issue section in this repository.

License

Free use of TCR2vec is granted under the terms of the GNU General Public License version 3 (GPLv3).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dist		dist
figures		figures
scripts		scripts
tcr2vec		tcr2vec
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCR2vec

Dependencies

Installation

Data

Usages of TCR2vec

Embedding TCRs

Evaluation of embeddings

Finetune of TCR2vec

Use trained models to make predictions

Download of pretrained TCR2vec models

Reconstruction of full TCR

Pretraining of TCR2vec

Notes

Contact

License

About

Releases

Packages

Languages

License

jiangdada1221/TCR2vec

Folders and files

Latest commit

History

Repository files navigation

TCR2vec

Dependencies

Installation

Data

Usages of TCR2vec

Embedding TCRs

Evaluation of embeddings

Finetune of TCR2vec

Use trained models to make predictions

Download of pretrained TCR2vec models

Reconstruction of full TCR

Pretraining of TCR2vec

Notes

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages