Indiana University, 150 S. Woodlawn Ave, Bloomington, IN 47405, United States.
Indiana University, 150 S. Woodlawn Ave, Bloomington, IN 47405, United States.
Methods. 2017 Oct 1;129:8-17. doi: 10.1016/j.ymeth.2017.04.018. Epub 2017 Apr 26.
Recent years have witnessed unprecedented accumulation of DNA sequences and therefore protein sequences (predicted from DNA sequences), due to the advances of sequencing technology. One of the major sources of the hypothetical proteins is the metagenomics research. Current annotation of metagenomes (collections of short metagenomic sequences or assemblies) relies on similarity searches against known gene/protein families, based on which functional profiles of microbial communities can be built. This practice, however, leaves out the hypothetical proteins, which may outnumber the known proteins for many microbial communities. On the other hand, we may ask: what can we gain from the large number of metagenomes made available by the metagenomic studies, for the annotation of metagenomic sequences as well as functional annotation of hypothetical proteins in general? Here we propose a community profiling approach for predicting functional associations between proteins: two proteins are predicted to be associated if they share similar presence and absence profiles (called community profiles) across microbial communities. Community profiling is conceptually similar to the phylogenetic profiling approach to functional prediction, however with fundamental differences. We tested different profile construction methods, the selection of reference metagenomes, and correlation metrics, among others, to optimize the performance of this new approach. We demonstrated that the community profiling approach alone slightly outperforms the phylogenetic profiling approach for associating proteins in species that are well represented by sequenced genomes, and combining phylogenetic and community profiling further improves (though only marginally) the prediction of functional association. Further we showed that community profiling method significantly outperforms phylogenetic profiling, revealing more functional associations, when applied to a more recently sequenced bacterial genome.
近年来,由于测序技术的进步,DNA 序列和蛋白质序列(由 DNA 序列预测得到)的积累前所未有。假设蛋白的主要来源之一是宏基因组学研究。目前,宏基因组(短宏基因组序列或组装的集合)的注释依赖于基于相似性搜索已知基因/蛋白质家族的方法,在此基础上可以构建微生物群落的功能谱。然而,这种做法忽略了假设蛋白,对于许多微生物群落,假设蛋白的数量可能超过已知蛋白。另一方面,我们可能会问:从宏基因组研究提供的大量宏基因组中,我们可以为宏基因组序列的注释以及一般假设蛋白的功能注释获得什么?在这里,我们提出了一种用于预测蛋白质之间功能关联的群落分析方法:如果两个蛋白质在微生物群落中具有相似的存在和缺失模式(称为群落模式),则预测它们存在关联。群落分析在概念上类似于功能预测的系统发育分析方法,但存在根本差异。我们测试了不同的模式构建方法、参考宏基因组的选择和相关指标等,以优化这种新方法的性能。我们证明,在具有测序基因组充分代表性的物种中,群落分析方法本身在关联蛋白质方面略优于系统发育分析方法,而将系统发育和群落分析结合起来进一步提高(尽管只是略有提高)功能关联的预测。此外,当应用于最近测序的细菌基因组时,我们发现群落分析方法显著优于系统发育分析方法,揭示了更多的功能关联。