Arakaki Adrian K, Tian Weidong, Skolnick Jeffrey
Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA.
BMC Genomics. 2006 Dec 13;7:315. doi: 10.1186/1471-2164-7-315.
The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.
Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).
Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.
新测序基因组中大多数基因的功能注释是通过与先前已表征序列的相似性推断得出的,这种注释策略常常导致错误的分配。我们使用了EFICAz的更新版本对245个基因组进行了重新注释,EFICAz是一种用于酶功能预测的高精度方法。
基于我们的三字段酶委员会(EC)编号预测,我们获得了古菌(29%)、细菌(30%)和真核生物(18%)中平均酶含量的下限估计值。2005年至2006年在京都基因与基因组百科全书(KEGG)中添加的大多数注释与2005年EFICAz的预测一致。EFICAz预测的覆盖范围显著高于KEGG,尤其是对于真核生物。我们的数千个新预测对应于假设蛋白。我们鉴定出了64个与EFICAz训练酶序列同一性较低的假设蛋白子集,其生化功能最近已得到表征,并发现我们在96%(84%)的情况下正确鉴定了它们的三字段(四字段)EC编号。对于这64个假设蛋白中的两个:铜绿假单胞菌的PA1167,一种藻酸盐裂解酶(EC 4.2.2.3)和结核分枝杆菌H37Rv的Rv1700,一种ADP - 核糖二磷酸酶(EC 3.6.1.13),我们在数据库中检测到了超过两年的注释延迟。给出了两个例子,其中EFICAz预测可作为理解假设蛋白功能作用的假设生成器:FLJ11151,一种在癌症中过度表达的人类蛋白,EFICAz将其鉴定为一种内聚磷酸酶(EC 3.6.1.10),以及MW0119,金黄色葡萄球菌MW2菌株的一种蛋白,基于其EFICAz预测的活性,即鞘磷脂磷酸二酯酶(EC 3.1.4.12),我们将其提议为候选毒力因子。
我们的结果表明我们生成了高精度和高召回率的酶功能注释。这些预测可以进行挖掘并与其他信息源相关联,以生成具有生物学意义的假设,并且可用于比较基因组分析和自动化代谢途径重建。