Natale D A, Shankavaram U T, Galperin M Y, Wolf Y I, Aravind L, Koonin E V
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Rockville Pike, Bethesda, MD 20894, USA.
Genome Biol. 2000;1(5):RESEARCH0009. doi: 10.1186/gb-2000-1-5-research0009. Epub 2000 Nov 6.
Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi.
A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix.
Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange.
标准的存档序列数据库并非设计用于基因组注释,远非实现此目的的最佳工具。我们使用蛋白质直系同源簇数据库(COGs)对两种古菌的基因组进行重新注释,这两种古菌分别是嗜热栖热菌(Aeropyrum pernix),它是泉古菌门中首个被测序的成员,以及深渊嗜热栖热菌(Pyrococcus abyssi)。
使用COGNITOR程序将嗜热栖热菌和深渊嗜热栖热菌的蛋白质归入COGs;逐一验证结果,并通过使用PSI-BLAST和TBLASTN程序进行额外的数据库搜索来补充。对超过300种来自嗜热栖热菌的蛋白质预测了功能,使用具有保守序列相似性阈值的传统方法这些蛋白质无法被赋予功能,相比原始注释增加了约50%。嗜热栖热菌共享了先前在广古菌门中鉴定出的大部分保守核心蛋白质。基于COGs中基因组的共现进行聚类分析或构建距离矩阵树表明,嗜热栖热菌在古菌中形成一个独特的类群,尽管观察到它与两种嗜热栖热菌属物种归为一组,这表明保守基因库具有相似性。在这些分析中未发现泉古菌门与真核生物之间存在特定关系的迹象。在嗜热栖热菌中意外缺失了一些在广古菌门和大多数细菌中保守的蛋白质,包括整套从头嘌呤生物合成酶、GTP酶FtsZ(细菌和广古菌细胞分裂机制的关键组成部分)以及先前被认为是通用的tRNA特异性假尿苷合酶。嗜热栖热菌在48个不包含任何广古菌门成员的COGs中有所体现。这些蛋白质中的许多是三羧酸循环和电子传递链酶,反映了嗜热栖热菌的需氧生活方式。
基于系统发育分析构建并针对已知和预测的蛋白质功能进行精心策划的专用数据库,可显著改善基因组注释。差异基因组展示方法有助于系统研究基因库的共同和独特特征,在某些情况下揭示出意想不到的联系,这些联系可能表明系统发育上距离遥远的生物体之间的功能相似性以及横向基因转移。