解析微生物组生态位与同源序列的联系，能够实现精确靶向的蛋白质结构预测。

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction.

机构信息

Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.

出版信息

Proc Natl Acad Sci U S A. 2021 Dec 7;118(49). doi: 10.1073/pnas.2110828118.

DOI:10.1073/pnas.2110828118

PMID:34873061

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8670487/

Abstract

Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning-based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.

摘要

通过深度学习技术从宏基因组序列中获取的信息极大地提高了无模板蛋白质结构建模的准确性。然而，大多数基于深度学习的建模研究都是基于盲目序列数据库搜索，在计算资源利用和模型构建方面效率低下，尤其是当序列库变得非常大时。我们提出了一种基于来自四个主要生境（肠道、湖泊、土壤和发酵罐）的 42.5 亿个微生物组序列的 MetaSource 模型，以解码微生物小生境与蛋白质同源家族之间的内在联系。对 8700 个未知 Pfam 家族进行的大规模蛋白质家族折叠实验表明，与使用组合宏基因组数据集相比，从单个 MetaSource 生境构建的具有多个序列比对的微生物组靶向方法需要的计算机内存和 CPU（中央处理单元）时间减少了三倍以上，但生成的接触图和三维结构模型的准确性显著提高。这些结果表明了一种弥合快速增长的宏基因组数据库与高效全基因组数据库挖掘的有限计算资源之间差距的途径，为未来微生物组序列数据库和建模开发提供了有价值的指导，以实现高精度的蛋白质结构和功能预测。

相似文献

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction.

Proc Natl Acad Sci U S A. 2021 Dec 7;118(49). doi: 10.1073/pnas.2110828118.

Ensembling multiple raw coevolutionary features with deep residual neural networks for contact-map prediction in CASP13.

Proteins. 2019 Dec;87(12):1082-1091. doi: 10.1002/prot.25798. Epub 2019 Aug 22.

Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13.

Proteins. 2019 Dec;87(12):1165-1178. doi: 10.1002/prot.25697. Epub 2019 Apr 25.

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.

PLoS Comput Biol. 2017 Jan 5;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

Protein structure determination using metagenome sequence data.

Science. 2017 Jan 20;355(6322):294-298. doi: 10.1126/science.aah4043.

Protein remote homology detection and structural alignment using deep learning.

Nat Biotechnol. 2024 Jun;42(6):975-985. doi: 10.1038/s41587-023-01917-2. Epub 2023 Sep 7.

Protein contact prediction using metagenome sequence data and residual neural networks.

Bioinformatics. 2020 Jan 1;36(1):41-48. doi: 10.1093/bioinformatics/btz477.

Deep-learning contact-map guided protein structure prediction in CASP13.

Proteins. 2019 Dec;87(12):1149-1164. doi: 10.1002/prot.25792. Epub 2019 Aug 14.

Enhancing Evolutionary Couplings with Deep Convolutional Neural Networks.

Cell Syst. 2018 Jan 24;6(1):65-74.e3. doi: 10.1016/j.cels.2017.11.014. Epub 2017 Dec 20.

Protein structure prediction enhanced with evolutionary diversity: SPEED.

Protein Sci. 2010 Mar;19(3):520-34. doi: 10.1002/pro.330.

引用本文的文献

Boosting AlphaFold Protein Tertiary Structure Prediction through MSA Engineering and Extensive Model Sampling and Ranking in CASP16.

bioRxiv. 2025 Jun 9:2025.06.06.658338. doi: 10.1101/2025.06.06.658338.

Boosting AlphaFold Protein Tertiary Structure Prediction through MSA Engineering and Extensive Model Sampling and Ranking in CASP16.

Res Sq. 2025 Jun 20:rs.3.rs-6845168. doi: 10.21203/rs.3.rs-6845168/v1.

Improving AlphaFold2- and AlphaFold3-Based Protein Complex Structure Prediction With MULTICOM4 in CASP16.

Proteins. 2025 Jun 2. doi: 10.1002/prot.26850.

Deep-learning-based single-domain and multidomain protein structure prediction with D-I-TASSER.

Nat Biotechnol. 2025 May 23. doi: 10.1038/s41587-025-02654-4.

Advanced computational tools, artificial intelligence and machine-learning approaches in gut microbiota and biomarker identification.

Front Med Technol. 2025 Apr 15;6:1434799. doi: 10.3389/fmedt.2024.1434799. eCollection 2024.

Leveraging Sequence Purification for Accurate Prediction of Multiple Conformational States with AlphaFold2.

Res Sq. 2025 Mar 4:rs.3.rs-6087969. doi: 10.21203/rs.3.rs-6087969/v1.

The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.

Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531.

Beyond AlphaFold2: The Impact of AI for the Further Improvement of Protein Structure Prediction.

Methods Mol Biol. 2025;2867:121-139. doi: 10.1007/978-1-0716-4196-5_7.

One step forward towards deep-learning protein complex structure prediction by precise multiple sequence alignment construction.

Clin Transl Med. 2024 Jun;14(6):e1689. doi: 10.1002/ctm2.1689.

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives.

Imeta. 2022 Mar 6;1(1):e9. doi: 10.1002/imt2.9. eCollection 2022 Mar.

本文引用的文献

Highly accurate protein structure prediction with AlphaFold.

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks.

PLoS Comput Biol. 2021 Mar 26;17(3):e1008865. doi: 10.1371/journal.pcbi.1008865. eCollection 2021 Mar.

The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities.

Nucleic Acids Res. 2021 Jan 8;49(D1):D751-D763. doi: 10.1093/nar/gkaa939.

Characteristics and in situ remediation effects of heavy metal immobilizing bacteria on cadmium and nickel co-contaminated soil.

Ecotoxicol Environ Saf. 2020 Apr 1;192:110294. doi: 10.1016/j.ecoenv.2020.110294. Epub 2020 Feb 7.

Genetic and Biochemical Analysis of Anaerobic Respiration in Bacteroides fragilis and Its Importance .

mBio. 2020 Feb 4;11(1):e03238-19. doi: 10.1128/mBio.03238-19.

Improved protein structure prediction using potentials from deep learning.

Nature. 2020 Jan;577(7792):706-710. doi: 10.1038/s41586-019-1923-7. Epub 2020 Jan 15.

Improved protein structure prediction using predicted interresidue orientations.

Proc Natl Acad Sci U S A. 2020 Jan 21;117(3):1496-1503. doi: 10.1073/pnas.1914677117. Epub 2020 Jan 2.

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins.

Bioinformatics. 2020 Apr 1;36(7):2105-2112. doi: 10.1093/bioinformatics/btz863.

MGnify: the microbiome analysis resource in 2020.

Nucleic Acids Res. 2020 Jan 8;48(D1):D570-D578. doi: 10.1093/nar/gkz1035.

Fueling ab initio folding with marine metagenomics enables structure and function predictions of new protein families.

Genome Biol. 2019 Nov 1;20(1):229. doi: 10.1186/s13059-019-1823-z.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

解析微生物组生态位与同源序列的联系，能够实现精确靶向的蛋白质结构预测。

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献