Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi 110020, India.
Methods. 2024 Dec;232:18-28. doi: 10.1016/j.ymeth.2024.10.007. Epub 2024 Oct 19.
HLA-DRB104:01 is associated with numerous diseases, including sclerosis, arthritis, diabetes, and COVID-19, emphasizing the need to scan for binders in the antigens to develop immunotherapies and vaccines. Current prediction methods are often limited by their reliance on the small datasets. This study presents HLA-DR4Pred2, developed on a large dataset containing 12,676 binders and an equal number of non-binders. It's an improved version of HLA-DR4Pred, which was trained on a small dataset, containing 576 binders and an equal number of non-binders. All models were trained, optimized, and tested on 80 % of the data using five-fold cross-validation and evaluated on the remaining 20 %. A range of machine learning techniques was employed, achieving maximum AUROC of 0.90 and 0.87, using composition and binary profile features, respectively. The performance of the composition-based model increased to 0.93, when combined with BLAST search. Additionally, models developed on the realistic dataset containing 12,676 binders and 86,300 non-binders, achieved a maximum AUROC of 0.99. Our proposed method outperformed existing methods when we compared the performance of our best model to that of existing methods on the independent dataset. Finally, we developed a standalone tool and a webserver for HLADR4Pred2, enabling the prediction, design, and virtual scanning of HLA-DRB104:01 binding peptides, and we also released a Python package available on the Python Package Index (https://webs.iiitd.edu.in/raghava/hladr4pred2/; https://github.com/raghavagps/hladr4pred2; https://pypi.org/project/hladr4pred2/).
HLA-DRB104:01 与许多疾病有关,包括硬化症、关节炎、糖尿病和 COVID-19,这强调了需要在抗原中扫描结合物以开发免疫疗法和疫苗。目前的预测方法通常受到其对小数据集的依赖的限制。本研究提出了 HLA-DR4Pred2,它是在包含 12676 个结合物和相同数量非结合物的大型数据集上开发的。它是 HLA-DR4Pred 的改进版本,后者是在包含 576 个结合物和相同数量非结合物的小数据集上训练的。所有模型都使用五重交叉验证在 80%的数据上进行训练、优化和测试,并在其余 20%的数据上进行评估。使用了一系列机器学习技术,分别使用组成和二进制特征,实现了最大 AUROC 为 0.90 和 0.87。当与 BLAST 搜索结合使用时,基于组成的模型的性能提高到 0.93。此外,当我们将最佳模型的性能与独立数据集上的现有方法进行比较时,在包含 12676 个结合物和 86300 个非结合物的现实数据集上开发的模型实现了最大 AUROC 为 0.99。当我们将最佳模型的性能与独立数据集上的现有方法进行比较时,我们提出的方法在性能上优于现有方法。最后,我们开发了一个独立的工具和一个 HLA-DR4Pred2 的网络服务器,用于 HLA-DRB104:01 结合肽的预测、设计和虚拟扫描,我们还在 Python 包索引上发布了一个 Python 包(https://webs.iiitd.edu.in/raghava/hladr4pred2/; https://github.com/raghavagps/hladr4pred2/; https://pypi.org/project/hladr4pred2/)。