Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland.
Ardigen, Krakow, Poland.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad468.
The advent of T-cell receptor (TCR) sequencing experiments allowed for a significant increase in the amount of peptide:TCR binding data available and a number of machine-learning models appeared in recent years. High-quality prediction models for a fixed epitope sequence are feasible, provided enough known binding TCR sequences are available. However, their performance drops significantly for previously unseen peptides.
We prepare the dataset of known peptide:TCR binders and augment it with negative decoys created using healthy donors' T-cell repertoires. We employ deep learning methods commonly applied in Natural Language Processing to train part a peptide:TCR binding model with a degree of cross-peptide generalization (0.69 AUROC). We demonstrate that BERTrand outperforms the published methods when evaluated on peptide sequences not used during model training.
The datasets and the code for model training are available at https://github.com/SFGLab/bertrand.
T 细胞受体 (TCR) 测序实验的出现使得可获得的肽:TCR 结合数据量显著增加,近年来出现了许多机器学习模型。对于固定的表位序列,高质量的预测模型是可行的,前提是有足够多的已知结合 TCR 序列。然而,对于以前未见过的肽,它们的性能会显著下降。
我们准备了已知肽:TCR 结合物的数据集,并使用来自健康供体的 T 细胞库创建的阴性诱饵来扩充它。我们采用自然语言处理中常用的深度学习方法来训练部分肽:TCR 结合模型,具有一定的跨肽泛化能力(0.69 AUROC)。我们证明,在评估未用于模型训练的肽序列时,BERTrand 的表现优于已发表的方法。
数据集和模型训练代码可在 https://github.com/SFGLab/bertrand 上获得。