Suppr超能文献

一种用于预测神经毒性肽和神经毒素的大型语言模型。

A large language model for predicting neurotoxic peptides and neurotoxins.

作者信息

Rathore Anand Singh, Jain Saloni, Choudhury Shubham, Raghava Gajendra P S

机构信息

Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.

出版信息

Protein Sci. 2025 Aug;34(8):e70200. doi: 10.1002/pro.70200.

Abstract

The accurate prediction of neurotoxicity in peptides and proteins is essential for the safety evaluation of therapeutic proteins and genetically modified (GM) organisms. Existing tools, including our earlier method NTxPred, typically use a single predictive model for both neurotoxic peptides and proteins, despite their structural and functional differences. This lack of specialization may lead to suboptimal performance and limited generalizability. To address this, we developed NTxPred2, distinct, specialized models for predicting neurotoxic peptides and neurotoxins (proteins). Our curated datasets include 877 neurotoxic and 877 non-toxic peptides, and 775 neurotoxic and 775 non-toxic proteins. Certain residues, like cysteine, are prevalent in both but in different magnitudes. Using composition and binary profiles, our machine-learning models achieved an area under the curve (AUC) of 0.97 for peptides and 0.85 for proteins, improving to 0.89 with evolutionary information. Models using protein embeddings reached 0.96 AUC for peptides and 0.94 for proteins, while protein language models achieved 0.98 (esm2-t30) and 0.91 (esm2-t6). All models were validated via five-fold cross-validation, and the final models were evaluated on an independent dataset. We further assessed protein models on the peptide dataset and vice versa, highlighting the necessity of separate models. The proposed models outperform existing methods on independent datasets that are not used for training. Our neurotoxicity prediction models will aid in the safety assessment of GM foods and therapeutic proteins by minimizing the need for animal testing. To support the scientific community, we developed a standalone software and web server NTxPred2 for predicting and scanning neurotoxins (https://webs.iiitd.edu.in/raghava/ntxpred2/, https://github.com/raghavagps/ntxpred2/).

摘要

准确预测肽和蛋白质的神经毒性对于治疗性蛋白质和转基因生物的安全性评估至关重要。现有的工具,包括我们早期的方法NTxPred,通常对神经毒性肽和蛋白质都使用单一的预测模型,尽管它们在结构和功能上存在差异。这种缺乏针对性的做法可能导致性能欠佳和通用性受限。为了解决这个问题,我们开发了NTxPred2,即用于预测神经毒性肽和神经毒素(蛋白质)的不同的、专门的模型。我们精心策划的数据集包括877种神经毒性肽和877种无毒肽,以及775种神经毒性蛋白质和775种无毒蛋白质。某些残基,如半胱氨酸,在两者中都很常见,但含量不同。利用组成和二元图谱,我们的机器学习模型对肽的曲线下面积(AUC)达到0.97,对蛋白质达到0.85,加入进化信息后提高到0.89。使用蛋白质嵌入的模型对肽的AUC达到0.96,对蛋白质达到0.94,而蛋白质语言模型分别达到0.98(esm2-t30)和0.91(esm2-t6)。所有模型均通过五折交叉验证进行验证,最终模型在独立数据集上进行评估。我们还在肽数据集上评估了蛋白质模型,反之亦然,突出了单独模型的必要性。所提出的模型在未用于训练的数据独立数据集上优于现有方法。我们的神经毒性预测模型将通过尽量减少动物试验的需求,帮助进行转基因食品和治疗性蛋白质的安全性评估。为了支持科学界,我们开发了一个独立的软件和网络服务器NTxPred2,用于预测和扫描神经毒素(https://webs.iiitd.edu.in/raghava/ntxpred2/,https://github.com/raghavagps/ntxpred2/)。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验