Ye Yuzhen, Doak Thomas G
School of Informatics, Indiana University, Bloomington, IN, USA.
PLoS Comput Biol. 2009 Aug;5(8):e1000465. doi: 10.1371/journal.pcbi.1000465. Epub 2009 Aug 14.
A common biological pathway reconstruction approach -- as implemented by many automatic biological pathway services (such as the KAAS and RAST servers) and the functional annotation of metagenomic sequences -- starts with the identification of protein functions or families (e.g., KO families for the KEGG database and the FIG families for the SEED database) in the query sequences, followed by a direct mapping of the identified protein families onto pathways. Given a predicted patchwork of individual biochemical steps, some metric must be applied in deciding what pathways actually exist in the genome or metagenome represented by the sequences. Commonly, and straightforwardly, a complete biological pathway can be identified in a dataset if at least one of the steps associated with the pathway is found. We report, however, that this naïve mapping approach leads to an inflated estimate of biological pathways, and thus overestimates the functional diversity of the sample from which the DNA sequences are derived. We developed a parsimony approach, called MinPath (Minimal set of Pathways), for biological pathway reconstructions using protein family predictions, which yields a more conservative, yet more faithful, estimation of the biological pathways for a query dataset. MinPath identified far fewer pathways for the genomes collected in the KEGG database -- as compared to the naïve mapping approach -- eliminating some obviously spurious pathway annotations. Results from applying MinPath to several metagenomes indicate that the common methods used for metagenome annotation may significantly overestimate the biological pathways encoded by microbial communities.
一种常见的生物途径重建方法——许多自动生物途径服务(如KAAS和RAST服务器)以及宏基因组序列的功能注释都采用了这种方法——首先是在查询序列中识别蛋白质功能或家族(例如,KEGG数据库的KO家族和SEED数据库的FIG家族),然后将识别出的蛋白质家族直接映射到途径上。给定一个预测的单个生化步骤拼凑图,必须应用某种度量来确定由这些序列代表的基因组或宏基因组中实际存在哪些途径。通常且直接地,如果发现与该途径相关的至少一个步骤,就可以在数据集中识别出完整的生物途径。然而,我们报告称,这种简单的映射方法会导致对生物途径的估计过高,从而高估了从中获取DNA序列的样本的功能多样性。我们开发了一种简约方法,称为MinPath(最小途径集),用于使用蛋白质家族预测进行生物途径重建,它能对查询数据集的生物途径产生更保守但更准确的估计。与简单映射方法相比,MinPath为KEGG数据库中收集的基因组识别出的途径要少得多,消除了一些明显虚假的途径注释。将MinPath应用于几个宏基因组的结果表明,用于宏基因组注释的常用方法可能会显著高估微生物群落编码的生物途径。