Suppr超能文献

通过知识蒸馏和自然语言处理增强拟南芥泛素化位点预测。

Enhancing Arabidopsis thaliana ubiquitination site prediction through knowledge distillation and natural language processing.

机构信息

University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Viet Nam.

University of Economics and Business Administration, Thai Nguyen University, Thai Nguyen, Viet Nam.

出版信息

Methods. 2024 Dec;232:65-71. doi: 10.1016/j.ymeth.2024.10.006. Epub 2024 Oct 22.

Abstract

Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting Arabidopsis thaliana ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species "Teacher model" to guide a more compact, species-specific "Student model", with the "Teacher" generating pseudo-labels that enhance the "Student" learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model's superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: https://github.com/nuinvtnu/KD_ArapUbi.

摘要

蛋白质泛素化是一种关键的翻译后修饰(PTM),参与多种生物过程,在调节生理机制和疾病状态方面发挥着关键作用。尽管人们做出了各种努力来开发跨物种的泛素化位点预测工具,但这些工具主要依赖于预定义的序列特征和机器学习算法,而物种特异性的泛素化模式仍然了解甚少。本研究提出了一种使用基于知识蒸馏和蛋白质序列自然语言处理(NLP)的神经网络模型来预测拟南芥泛素化位点的新方法。我们的框架采用多物种“教师模型”来指导更紧凑的、物种特异性的“学生模型”,“教师”生成伪标签,增强“学生”的学习和预测稳健性。交叉验证结果表明,我们的模型表现出色,准确率为 86.3%,曲线下面积(AUC)为 0.926,而独立测试则以 86.3%的准确率和 0.923 的 AUC 验证了这些结果。与已建立的预测器的比较分析进一步突出了该模型的优越性,强调了在泛素化预测任务中整合知识蒸馏和 NLP 的有效性。本研究提出了一种有前途且高效的泛素化位点预测方法,为相关领域的研究人员提供了有价值的见解。代码和资源可在 GitHub 上获得:https://github.com/nuinvtnu/KD_ArapUbi。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验