Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.
Proc Natl Acad Sci U S A. 2021 Dec 7;118(49). doi: 10.1073/pnas.2110828118.
Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning-based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.
通过深度学习技术从宏基因组序列中获取的信息极大地提高了无模板蛋白质结构建模的准确性。然而,大多数基于深度学习的建模研究都是基于盲目序列数据库搜索,在计算资源利用和模型构建方面效率低下,尤其是当序列库变得非常大时。我们提出了一种基于来自四个主要生境(肠道、湖泊、土壤和发酵罐)的 42.5 亿个微生物组序列的 MetaSource 模型,以解码微生物小生境与蛋白质同源家族之间的内在联系。对 8700 个未知 Pfam 家族进行的大规模蛋白质家族折叠实验表明,与使用组合宏基因组数据集相比,从单个 MetaSource 生境构建的具有多个序列比对的微生物组靶向方法需要的计算机内存和 CPU(中央处理单元)时间减少了三倍以上,但生成的接触图和三维结构模型的准确性显著提高。这些结果表明了一种弥合快速增长的宏基因组数据库与高效全基因组数据库挖掘的有限计算资源之间差距的途径,为未来微生物组序列数据库和建模开发提供了有价值的指导,以实现高精度的蛋白质结构和功能预测。