Suppr超能文献

蛋白质结构预测需要多少宏基因组数据:从生态和进化角度看靶向方法的优势

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives.

作者信息

Yang Pengshuo, Ning Kang

机构信息

Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-Imaging, Department of Bioinformatics and Systems Biology Center of AI Biology, College of Life Science and Technology, Huazhong University of Science and Technology Wuhan Hubei China.

出版信息

Imeta. 2022 Mar 6;1(1):e9. doi: 10.1002/imt2.9. eCollection 2022 Mar.

Abstract

It has been proven that three-dimensional protein structures could be modeled by supplementing homologous sequences with metagenome sequences. Even though a large volume of metagenome data is utilized for such purposes, a significant proportion of proteins remain unsolved. In this review, we focus on identifying ecological and evolutionary patterns in metagenome data, decoding the complicated relationships of these patterns with protein structures, and investigating how these patterns can be effectively used to improve protein structure prediction. First, we proposed the metagenome utilization efficiency and marginal effect model to quantify the divergent distribution of homologous sequences for the protein family. Second, we proposed that the targeted approach effectively identifies homologous sequences from specified biomes compared with the untargeted approach's blind search. Finally, we determined the lower bound for metagenome data required for predicting all the protein structures in the Pfam database and showed that the present metagenome data is insufficient for this purpose. In summary, we discovered ecological and evolutionary patterns in the metagenome data that may be used to predict protein structures effectively. The targeted approach is promising in terms of effectively extracting homologous sequences and predicting protein structures using these patterns.

摘要

已证明,通过用宏基因组序列补充同源序列,可以对三维蛋白质结构进行建模。尽管大量宏基因组数据用于此目的,但仍有相当一部分蛋白质结构尚未得到解决。在本综述中,我们专注于识别宏基因组数据中的生态和进化模式,解读这些模式与蛋白质结构之间的复杂关系,并研究如何有效利用这些模式来改进蛋白质结构预测。首先,我们提出了宏基因组利用效率和边际效应模型,以量化蛋白质家族同源序列的差异分布。其次,我们提出,与非靶向方法的盲目搜索相比,靶向方法能有效从特定生物群落中识别同源序列。最后,我们确定了预测Pfam数据库中所有蛋白质结构所需的宏基因组数据下限,并表明目前的宏基因组数据不足以实现这一目的。总之,我们在宏基因组数据中发现了可有效用于预测蛋白质结构的生态和进化模式。靶向方法在有效提取同源序列并利用这些模式预测蛋白质结构方面很有前景。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3d1/10989767/0ef7e8046ae0/IMT2-1-e9-g004.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验