Suppr超能文献

通过整合序列和结构特征增强蛋白质功能同一性的预测。

Enhanced prediction of protein functional identity through the integration of sequence and structural features.

作者信息

Fujita Suguru, Terada Tohru

机构信息

Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan.

出版信息

Comput Struct Biotechnol J. 2024 Nov 14;23:4124-4130. doi: 10.1016/j.csbj.2024.11.028. eCollection 2024 Dec.

Abstract

Although over 300 million protein sequences are registered in a reference sequence database, only 0.2 % have experimentally determined functions. This suggests that many valuable proteins, potentially catalyzing novel enzymatic reactions, remain undiscovered among the vast number of function-unknown proteins. In this study, we developed a method to predict whether two proteins catalyze the same enzymatic reaction by analyzing sequence and structural similarities, utilizing structural models predicted by AlphaFold2. We performed pocket detection and domain decomposition for each structural model. The similarity between protein pairs was assessed using features such as full-length sequence similarity, domain structural similarity, and pocket similarity. We developed several models using conventional machine learning algorithms and found that the LightGBM-based model outperformed the models. Our method also surpassed existing approaches, including those based solely on full-length sequence similarity and state-of-the-art deep learning models. Feature importance analysis revealed that domain sequence identity, calculated through structural alignment, had the greatest influence on the prediction. Therefore, our findings demonstrate that integrating sequence and structural information improves the accuracy of protein function prediction.

摘要

尽管参考序列数据库中登记了超过3亿条蛋白质序列,但只有0.2%的序列具有通过实验确定的功能。这表明,在大量功能未知的蛋白质中,许多潜在催化新型酶促反应的有价值蛋白质仍未被发现。在本研究中,我们开发了一种方法,通过分析序列和结构相似性,利用AlphaFold2预测的结构模型,来预测两种蛋白质是否催化相同的酶促反应。我们对每个结构模型进行了口袋检测和结构域分解。使用全长序列相似性、结构域结构相似性和口袋相似性等特征评估蛋白质对之间的相似性。我们使用传统机器学习算法开发了几种模型,发现基于LightGBM的模型表现优于其他模型。我们的方法也超越了现有方法,包括仅基于全长序列相似性的方法和最先进的深度学习模型。特征重要性分析表明,通过结构比对计算的结构域序列同一性对预测影响最大。因此,我们的研究结果表明,整合序列和结构信息可提高蛋白质功能预测的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e33/11609699/915cb6b33b5b/ga1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验