Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
Program in Applied Mathematics, Yale University, New Haven, CT, USA.
Nucleic Acids Res. 2024 Jan 25;52(2):548-557. doi: 10.1093/nar/gkad1128.
High throughput sequencing of B cell receptors (BCRs) is increasingly applied to study the immense diversity of antibodies. Learning biologically meaningful embeddings of BCR sequences is beneficial for predictive modeling. Several embedding methods have been developed for BCRs, but no direct performance benchmarking exists. Moreover, the impact of the input sequence length and paired-chain information on the prediction remains to be explored. We evaluated the performance of multiple embedding models to predict BCR sequence properties and receptor specificity. Despite the differences in model architectures, most embeddings effectively capture BCR sequence properties and specificity. BCR-specific embeddings slightly outperform general protein language models in predicting specificity. In addition, incorporating full-length heavy chains and paired light chain sequences improves the prediction performance of all embeddings. This study provides insights into the properties of BCR embeddings to improve downstream prediction applications for antibody analysis and discovery.
高通量测序 B 细胞受体 (BCR) 越来越多地应用于研究抗体的巨大多样性。学习 BCR 序列的生物学有意义的嵌入对于预测建模是有益的。已经开发了几种用于 BCR 的嵌入方法,但不存在直接的性能基准测试。此外,输入序列长度和配对链信息对预测的影响仍有待探索。我们评估了多种嵌入模型在预测 BCR 序列特性和受体特异性方面的性能。尽管模型架构存在差异,但大多数嵌入有效地捕获了 BCR 序列特性和特异性。在预测特异性方面,BCR 特异性嵌入略优于一般蛋白质语言模型。此外,包含全长重链和配对轻链序列可提高所有嵌入的预测性能。这项研究深入了解了 BCR 嵌入的特性,以改善抗体分析和发现的下游预测应用。