ETH Zurich, Computer Science, Zurich, Switzerland.
PLoS Comput Biol. 2013;9(1):e1002852. doi: 10.1371/journal.pcbi.1002852. Epub 2013 Jan 3.
New microbial genomes are sequenced at a high pace, allowing insight into the genetics of not only cultured microbes, but a wide range of metagenomic collections such as the human microbiome. To understand the deluge of genomic data we face, computational approaches for gene functional annotation are invaluable. We introduce a novel model for computational annotation that refines two established concepts: annotation based on homology and annotation based on phyletic profiling. The phyletic profiling-based model that includes both inferred orthologs and paralogs-homologs separated by a speciation and a duplication event, respectively-provides more annotations at the same average Precision than the model that includes only inferred orthologs. For experimental validation, we selected 38 poorly annotated Escherichia coli genes for which the model assigned one of three GO terms with high confidence: involvement in DNA repair, protein translation, or cell wall synthesis. Results of antibiotic stress survival assays on E. coli knockout mutants showed high agreement with our model's estimates of accuracy: out of 38 predictions obtained at the reported Precision of 60%, we confirmed 25 predictions, indicating that our confidence estimates can be used to make informed decisions on experimental validation. Our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time. Our predictions for 998 prokaryotic genomes include ~400000 specific annotations with the estimated Precision of 90%, ~19000 of which are highly specific-e.g. "penicillin binding," "tRNA aminoacylation for protein translation," or "pathogenesis"-and are freely available at http://gorbi.irb.hr/.
新的微生物基因组以高速测序,不仅使我们能够深入了解培养微生物的遗传学,还能广泛了解人类微生物组等宏基因组集。为了理解我们面临的基因组数据洪流,基于计算的基因功能注释方法是非常宝贵的。我们引入了一种新的计算注释模型,该模型改进了两个已建立的概念:基于同源性的注释和基于系统发育分布的注释。基于系统发育分布的模型包括推断的直系同源物和旁系同源物——分别由物种形成和复制事件分隔,与仅包括推断的直系同源物的模型相比,它提供了更多的注释,且平均精度相同。为了进行实验验证,我们选择了 38 个注释较差的大肠杆菌基因,该模型为这些基因分配了三个 GO 术语之一,置信度很高:涉及 DNA 修复、蛋白质翻译或细胞壁合成。对大肠杆菌敲除突变体进行抗生素应激存活实验的结果与我们模型的准确性估计高度一致:在报告的精度为 60%的 38 个预测中,我们验证了 25 个预测,表明我们的置信度估计可用于在实验验证方面做出明智的决策。我们的工作将有助于使计算预测的实验验证在成本和时间方面都更容易实现。我们对 998 个原核基因组的预测包括约 400000 个具有估计精度为 90%的特定注释,其中约 19000 个是高度特异性的,例如“青霉素结合”、“蛋白质翻译的 tRNA 氨酰化”或“发病机制”,并且可以在 http://gorbi.irb.hr/ 上免费获取。