Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
Sci Rep. 2024 Nov 25;14(1):29141. doi: 10.1038/s41598-024-80940-y.
Efficiently predicting the paratope holds immense potential for enhancing antibody design, treating cancers and other serious diseases, and advancing personalized medicine. Although traditional methods are highly accurate, they are often time-consuming, labor-intensive, and reliant on 3D structures, restricting their broader use. On the other hand, machine learning-based methods, besides relying on structural data, entail descriptor computation, consideration of diverse physicochemical properties, and feature engineering. Here, we develop a deep learning-assisted prediction method for paratope identification, relying solely on amino acid sequences and being antigen-agnostic. Built on the ProtTrans architecture, and utilizing pre-trained protein and antibody language models, we extract efficient embeddings for predicting paratope. By incorporating positional encoding for Complementarity Determining Regions, our model gains a deeper structural understanding, achieving remarkable performance with a 0.904 ROC AUC, 0.701 F1-score, and 0.585 MCC on benchmark datasets. In addition to yielding accurate antibody paratope predictions, our method exhibits strong performance in predicting nanobody paratope, achieving a ROC AUC of 0.912 and a PR AUC of 0.665 on the nanobody dataset. Notably, our approach outperforms structure-based prediction methods, boasting a PR AUC of 0.731. Various conducted ablation studies, which elaborate on the impact of each part of the model on the prediction task, show that the improvement in prediction performance by applying CDR positional encoding together with CNNs depends on the specific protein and antibody language models used. These results highlight the potential of our method to advance disease understanding and aid in the discovery of new diagnostics and antibody therapies.
高效预测抗体的结合表位在增强抗体设计、治疗癌症和其他严重疾病以及推进个性化医疗方面具有巨大潜力。虽然传统方法具有高度准确性,但它们通常耗时、费力且依赖于 3D 结构,限制了其更广泛的应用。另一方面,基于机器学习的方法除了依赖结构数据外,还需要进行描述符计算、考虑多种物理化学性质以及特征工程。在这里,我们开发了一种基于深度学习的抗体结合表位预测方法,该方法仅依赖于氨基酸序列,并且与抗原无关。该方法构建在 ProtTrans 架构之上,利用预先训练的蛋白质和抗体语言模型,为预测结合表位提取有效的嵌入。通过对互补决定区进行位置编码,我们的模型获得了更深入的结构理解,在基准数据集上取得了优异的性能,ROC AUC 为 0.904,F1 得分为 0.701,MCC 得分为 0.585。除了能够准确预测抗体的结合表位外,我们的方法在预测纳米抗体的结合表位方面也表现出了强大的性能,在纳米抗体数据集上的 ROC AUC 为 0.912,PR AUC 为 0.665。值得注意的是,我们的方法在预测性能上优于基于结构的预测方法,其 PR AUC 为 0.731。各种消融研究详细说明了模型的每个部分对预测任务的影响,结果表明,应用 CDR 位置编码与 CNN 一起可以提高预测性能,这取决于所使用的特定蛋白质和抗体语言模型。这些结果突出了我们的方法在推进疾病理解和辅助新诊断和抗体疗法发现方面的潜力。