Makarova Kira S, Sorokin Alexander V, Novichkov Pavel S, Wolf Yuri I, Koonin Eugene V
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Biol Direct. 2007 Nov 27;2:33. doi: 10.1186/1745-6150-2-33.
An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes.
New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover approximately 88% of the genes in a genome compared to a approximately 76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; approximately 40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems.
The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/.
对已测序基因组中的基因进行进化分类,区分直系同源基因和旁系同源基因,对于基因组注释和进化重建至关重要。在细菌、古菌和单细胞真核生物的多个基因组序列可得后不久,蛋白质直系同源簇(COG)就进行了这样的分类尝试。基因组序列的快速积累为完善COG创造了机会,但由于错误放大也带来了挑战。一种实际策略是为基因组的系统发育紧密子集构建完善的COG。
利用一种改进程序为41个古菌基因组(13个泉古菌门、27个广古菌门和1个纳古菌门)构建了新的古菌直系同源基因簇(arCOG),该程序采用较小的、组特异性簇之间的相似性树,半自动划分多结构域蛋白中的直系同源结构域,并使用轮廓搜索来鉴定远缘直系同源基因。arCOG的注释是基于COG、CDD数据库以及NR数据库中同源物注释的三种赋值的共识。平均而言,7538个arCOG覆盖了基因组中约88%的基因,而COG的覆盖度约为76%。arCOG中直系同源物鉴定的粒度更细,这从4538个arCOG对应于2362个COG这一事实可以明显看出;约40%的arCOG是新的。古菌基因核心(在所有41个基因组中都存在的蛋白质编码基因)由166个arCOG组成。arCOG被用于重建古菌进化过程中的基因丢失和基因获得事件以及祖先形式的基因集。保守估计,最后的古菌共同祖先(LACA)拥有996个基因,而泉古菌门和广古菌门的最后共同祖先分别有1245个和1335个基因。据推测,LACA是一种化学自养嗜热菌,除了核心古菌功能外,还编码了更多特殊系统,例如抗病毒防御的CASS系统和一些毒素 - 抗毒素系统。
arCOG为古菌基因组的功能注释、比较基因组学和进化重建提供了一个方便、灵活的框架。基因组重建表明,古菌的最后共同祖先可能(几乎)与现代古菌嗜热菌一样先进。arCOG及相关信息可在以下网址获取:ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/ 。