蛋酒(EggNOG):直系同源基因簇的自动构建与注释
eggNOG: automated construction and annotation of orthologous groups of genes.
作者信息
Jensen Lars Juhl, Julien Philippe, Kuhn Michael, von Mering Christian, Muller Jean, Doerks Tobias, Bork Peer
机构信息
European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany.
出版信息
Nucleic Acids Res. 2008 Jan;36(Database issue):D250-4. doi: 10.1093/nar/gkm796. Epub 2007 Oct 16.
The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annotation of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database ('evolutionary genealogy of genes: Non-supervised Orthologous Groups'), which contains orthologous groups constructed from Smith-Waterman alignments through identification of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914 mammalian orthologous groups. We automatically annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains. The orthologous groups in eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de.
直系同源基因的鉴定构成了大多数比较基因组学研究的基础。现有的方法要么缺乏对所鉴定直系同源组的功能注释,从而妨碍对后续结果的解读,要么是人工注释,因此落后于新基因组的快速测序。在此,我们展示了eggNOG数据库(“基因的进化谱系:非监督直系同源组”),该数据库包含通过确定相互最佳匹配和三角连锁聚类,从史密斯-沃特曼比对构建的直系同源组。将此程序应用于312个细菌、26个古菌和35个真核生物基因组,产生了43582个粗粒度直系同源组,其中9724个是原始COG/KOG数据库中直系同源组的扩展版本。我们还为选定的生物子集构建了更细粒度的组,例如19914个哺乳动物直系同源组。我们用功能描述自动注释了我们的非监督直系同源组,这些功能描述是通过根据基因各自的文本描述、注释的功能类别和预测的蛋白质结构域确定基因的共同特征而得出的。eggNOG中的直系同源组包含1241751个基因,并为其中77%的基因提供了至少宽泛的功能描述。用户可以通过网络界面查询该资源中单个基因的信息,或从http://eggnog.embl.de下载完整的直系同源组集合。