Tatusov Roman L, Fedorova Natalie D, Jackson John D, Jacobs Aviva R, Kiryutin Boris, Koonin Eugene V, Krylov Dmitri M, Mazumder Raja, Mekhedov Sergei L, Nikolskaya Anastasia N, Rao B Sridhar, Smirnov Sergei, Sverdlov Alexander V, Vasudevan Sona, Wolf Yuri I, Yin Jodie J, Natale Darren A
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
BMC Bioinformatics. 2003 Sep 11;4:41. doi: 10.1186/1471-2105-4-41.
The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.
We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or approximately 54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of approximately 20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (approximately 1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.
The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.
原核生物和真核生物多个基本完整的基因组序列的可得性,激发了构建这些基因组基因进化分类的需求和机会。这种基于基因间直系同源关系的分类系统,似乎是比较基因组学的自然框架,应有助于基因组的功能注释和大规模进化研究。
我们在此描述了先前开发的用于从原核生物和单细胞真核生物的测序基因组中划分直系同源蛋白簇(COG)以及为7个真核生物基因组构建预测直系同源簇的系统的重大更新,我们将这些真核生物直系同源簇命名为KOG(真核生物直系同源组)。COG集合目前由138,458个蛋白质组成,它们形成4873个COG,占单细胞生物66个基因组中编码的185,505个(预测)蛋白质的75%。真核生物直系同源组(KOG)包括来自7个真核生物基因组的蛋白质:三种动物(线虫秀丽隐杆线虫、果蝇黑腹果蝇和智人)、一种植物拟南芥、两种真菌(酿酒酵母和粟酒裂殖酵母)以及细胞内微孢子虫寄生虫兔脑炎微孢子虫。当前的KOG集由4852个直系同源簇组成,包括59,838个蛋白质,约占所分析的真核生物110,655个基因产物的54%。与原核生物基因组被COG覆盖的情况相比,可纳入KOG的真核生物基因比例要小得多;预计添加新的真核生物基因组将导致KOG对真核生物基因组的覆盖大幅增加。对KOG系统发育模式的研究揭示了一个在所有分析物种中都存在的保守核心,约占KOG集的20%。KOG集的这一保守部分远大于COG集的普遍部分(约占COG的1%)。部分差异可能是由于纳入的真核生物基因组数量较少,但这也可能反映了真核生物作为一个进化枝的相对紧凑性以及真核生物基因组更大的进化稳定性。
原核生物和真核生物直系同源蛋白集的更新集合,有望成为新测序基因组(包括复杂真核生物基因组)功能注释和全基因组进化研究的有用平台。