Feng Xikang, Huo Miaozhe, Li He, Yang Yongze, Jiang Yuepeng, He Liang, Cheng Li Shuai
School of Software, Northwestern Polytechnical University, 127 West Youyi Road, Beilin District, Xi'an Shaanxi, 710072, China.
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, 999077, China.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf030.
The complexity of T cell receptor (TCR) sequences, particularly within the complementarity-determining region 3 (CDR3), requires efficient embedding methods for applying machine learning to immunology. While various TCR CDR3 embedding strategies have been proposed, the absence of their systematic evaluations created perplexity in the community. Here, we extracted CDR3 embedding models from 19 existing methods and benchmarked these models with four curated datasets by accessing their impact on the performance of TCR downstream tasks, including TCR-epitope binding affinity prediction, epitope-specific TCR identification, TCR clustering, and visualization analysis. We assessed these models utilizing eight downstream classifiers and five downstream clustering methods, with the performance measured by a diverse range of metrics for precision, robustness, and usability. Overall, handcrafted embeddings outperformed data-driven ones in modeling TCR-epitope interactions. To further refine our comparative findings, we developed an all-in-one TCR CDR3 embedding package comprising all evaluated embedding models. This package will assist users in easily selecting suitable embedding models for their data.
T细胞受体(TCR)序列的复杂性,尤其是在互补决定区3(CDR3)内,需要有效的嵌入方法才能将机器学习应用于免疫学。虽然已经提出了各种TCR CDR3嵌入策略,但缺乏系统的评估在该领域造成了困惑。在这里,我们从19种现有方法中提取了CDR3嵌入模型,并通过访问它们对TCR下游任务性能的影响,用四个经过整理的数据集对这些模型进行了基准测试,这些下游任务包括TCR-表位结合亲和力预测、表位特异性TCR识别、TCR聚类和可视化分析。我们利用八个下游分类器和五种下游聚类方法评估了这些模型,其性能通过一系列用于精度、稳健性和可用性的指标来衡量。总体而言,在模拟TCR-表位相互作用方面,手工制作的嵌入方法优于数据驱动的方法。为了进一步完善我们的比较结果,我们开发了一个一体化的TCR CDR3嵌入软件包,其中包含所有评估过的嵌入模型。这个软件包将帮助用户轻松地为他们的数据选择合适的嵌入模型。