Galperin Michael Y, Makarova Kira S, Wolf Yuri I, Koonin Eugene V
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA
Nucleic Acids Res. 2015 Jan;43(Database issue):D261-9. doi: 10.1093/nar/gku1223. Epub 2014 Nov 26.
Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics.
微生物基因组测序项目产生了大量推导蛋白质序列,其中只有一小部分已经或将会进行实验研究。这使得序列分析成为注释这些蛋白质并赋予其暂定功能的唯一可行方法。蛋白质直系同源簇(COG)数据库(http://www.ncbi.nlm.nih.gov/COG/)于1997年首次创建,一直是功能注释的常用工具。它的成功主要基于以下几点:(i)依赖完整的微生物基因组,这使得大多数基因的直系同源物和旁系同源物能够可靠地分配;(ii)基于直系同源性的方法,该方法利用蛋白质家族(COG)中已表征成员的功能来为精心鉴定的直系同源物全集分配功能,并在有多个功能时描述潜在功能范围;(iii)对COG注释进行仔细的人工整理,旨在详细预测每个COG的生物学功能,同时避免注释错误和过度预测。在此,我们展示了自2003年以来COG的首次更新,以及对COG注释的全面修订和基因组覆盖范围的扩展,以纳入所有细菌和古菌谱系直至属水平的代表性完整基因组。对COG的重新分析表明,最初的COG分配错误率低于0.5%,并能够评估过去12年功能基因组学的进展。在此期间,许多先前未表征的COG的功能已被阐明,许多COG的暂定功能分配已通过靶向实验或高通量方法得到验证。一个特别重要的进展是为几种广泛存在的保守蛋白质赋予了功能,其中许多蛋白质参与翻译,特别是rRNA成熟和tRNA修饰。新版本的COG有望成为微生物基因组学的重要工具。