Tomer Ritu, Jain Shipra, Gahlot Pushpendra Singh, Bajiya Nisha, Raghava Gajendra P S
Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Front Immunol. 2025 Sep 1;16:1630863. doi: 10.3389/fimmu.2025.1630863. eCollection 2025.
Rheumatoid arthritis (RA) is an autoimmune disorder in which the immune system mounts an abnormal response to self-antigens, resulting in chronic inflammation and joint damage. Identifying antigenic regions in proteins that trigger RA is essential for the development of protein-based therapeutics.
We developed predictive models for HLA class II binding RA-inducing peptides using a dataset of 291 experimentally validated RA-inducing peptides and 165 RA non-inducing peptides. Positional and compositional analyses were performed to identify residue preferences. Alignment-based approaches (BLAST and MERCI), machine learning classifiers, deep learning, and protein language model-based methods were evaluated for predictive performance.
Compositional analysis revealed significant enrichment of glycine, proline, and tyrosine in RA-inducing peptides. Alignment-based approaches provided high precision but limited coverage. Among machine learning methods, XGBoost achieved the best performance (AUC = 0.75) on the validation dataset, while ProtBERT was the top-performing protein language model (AUC = 0.72). The ensemble model integrating XGBoost with MERCI-derived motifs yielded the highest overall performance (AUC = 0.80; MCC = 0.45) on an independent validation dataset.
This study presents computational strategies for identifying RA-inducing peptides and demonstrates the advantage of combining motif-based and machine learning approaches for improved performance. The findings are valuable for evaluating the safety of proteins in probiotics, genetically modified foods, and protein-based therapeutics. To facilitate broader use, the best-performing approach has been implemented in RAIpred, a web server and standalone software tool for predicting and scanning RA-inducing peptides, available at https://webs.iiitd.edu.in/raghava/raipred/.
类风湿性关节炎(RA)是一种自身免疫性疾病,免疫系统对自身抗原产生异常反应,导致慢性炎症和关节损伤。识别引发RA的蛋白质中的抗原区域对于基于蛋白质的治疗方法的开发至关重要。
我们使用291个经实验验证的RA诱导肽和165个RA非诱导肽的数据集,开发了用于预测HLA II类结合RA诱导肽的模型。进行了位置和组成分析以确定残基偏好。评估了基于比对的方法(BLAST和MERCI)、机器学习分类器、深度学习和基于蛋白质语言模型的方法的预测性能。
组成分析显示RA诱导肽中甘氨酸、脯氨酸和酪氨酸显著富集。基于比对的方法提供了高精度但覆盖范围有限。在机器学习方法中,XGBoost在验证数据集上表现最佳(AUC = 0.75),而ProtBERT是表现最佳的蛋白质语言模型(AUC = 0.72)。将XGBoost与MERCI衍生基序相结合的集成模型在独立验证数据集上产生了最高的整体性能(AUC = 0.80;MCC = 0.45)。
本研究提出了识别RA诱导肽的计算策略,并证明了结合基于基序和机器学习方法以提高性能的优势。这些发现对于评估益生菌、转基因食品和基于蛋白质的治疗方法中蛋白质的安全性具有重要价值。为了便于更广泛的使用,性能最佳的方法已在RAIpred中实现,RAIpred是一个用于预测和扫描RA诱导肽的网络服务器和独立软件工具,可在https://webs.iiitd.edu.in/raghava/raipred/获取。