Sarlakifar Faezeh, Malek Hamed, Allahyari Fard Najaf
Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran.
Department of Systems Biotechnology, National Institute of Genetic Engineering and Biotechnology (NIGEB), Tehran, Iran.
Biol Methods Protoc. 2025 Jul 9;10(1):bpaf040. doi: 10.1093/biomethods/bpaf040. eCollection 2025.
Allergens are a major concern in determining protein safety, especially with the growing use of recombinant proteins in new medical products. These proteins require a careful allergenicity assessment to guarantee their safety. However, traditional laboratory tests for allergenicity are expensive and time-consuming. To address this challenge, bioinformatics offers efficient and cost-effective alternatives for predicting protein allergenicity. Deep learning models offer a promising solution for this purpose. Recently, with the emergence of protein language models(pLMs), high-quality and impactful feature vectors can be extracted from protein sequences using these specialized language models. Although different computational methods can be effective individually, combining them could improve the prediction results. Given this hypothesis, can we develop a more powerful approach than existing methods to predict protein allergenicity? In this study, we developed an enhanced deep learning model to predict the potential allergenicity of proteins based on their primary structure represented as protein sequences. In simple terms, this model classifies protein sequences into allergenic or non-allergenic classes. Our approach utilizes two pLMs to extract distinct feature vectors for each sequence, which are then fed into a deep neural network (DNN) model for classification. Combining these feature vectors improves the results. Finally, we integrated our top-performing models using ensemble modeling techniques. This approach could balance the model's sensitivity and specificity. Our proposed model demonstrates an improvement compared to existing models, achieving a sensitivity of 97.91%, a specificity of 97.69%, an accuracy of 97.80%, and an area under the receiver operating characteristic curve of 99% using the standard 2-fold cross-validation. The AllerTrans model has been deployed as a web-based prediction tool and is publicly accessible at: https://huggingface.co/spaces/sfaezella/AllerTrans.
过敏原是确定蛋白质安全性时的一个主要问题,尤其是随着重组蛋白在新型医疗产品中的使用日益增加。这些蛋白质需要进行仔细的致敏性评估以确保其安全性。然而,传统的致敏性实验室检测既昂贵又耗时。为应对这一挑战,生物信息学为预测蛋白质致敏性提供了高效且经济高效的替代方法。深度学习模型为此提供了一个有前景的解决方案。最近,随着蛋白质语言模型(pLMs)的出现,可以使用这些专门的语言模型从蛋白质序列中提取高质量且有影响力的特征向量。尽管不同的计算方法单独使用时可能有效,但将它们结合起来可能会改善预测结果。基于这一假设,我们能否开发出一种比现有方法更强大的方法来预测蛋白质致敏性呢?在本研究中,我们开发了一种增强的深度学习模型,以根据蛋白质序列所代表的一级结构来预测蛋白质的潜在致敏性。简单来说,该模型将蛋白质序列分类为致敏或非致敏类别。我们的方法利用两个pLMs为每个序列提取不同的特征向量,然后将这些特征向量输入到一个深度神经网络(DNN)模型中进行分类。结合这些特征向量可改善结果。最后,我们使用集成建模技术整合了表现最佳的模型。这种方法可以平衡模型的敏感性和特异性。我们提出的模型与现有模型相比有改进,在标准的2折交叉验证中,敏感性达到97.91%,特异性达到97.69%,准确率达到97.80%,受试者操作特征曲线下面积达到99%。AllerTrans模型已作为基于网络的预测工具进行部署,可通过以下网址公开访问:https://huggingface.co/spaces/sfaezella/AllerTrans。