Kourmpetis Yiannis A I, van der Burgt Ate, Bink Marco C A M, Ter Braak Cajo J F, van Ham Roeland C H J
Biometris, Wageningen University and Research Centre, 6700 AC Wageningen, The Netherlands.
In Silico Biol. 2007;7(6):575-82.
The Gene Ontology (GO) is a widely used controlled vocabulary for the description of gene function. In this study we quantify the usage of multiple and hierarchically independent GO terms in the curated genome annotations of seven well-studied species. In most genomes, significant proportions (6-60%) of genes have been annotated with multiple and hierarchically independent terms. This may be necessary to attain adequate specificity of description. One noticeable exception is Arabidopsis thaliana, in which genes are much less frequently annotated with multiple terms (6-14%). In contrast, an analysis of the occurrence of InterPro hits in the proteomes of the seven species, followed by a mapping of the hits to GO terms, did not reveal an aberrant pattern for the A. thaliana genome. This study shows the widespread usage of multiple hierarchically independent GO terms in the functional annotation of genes. By consequence, probabilistic methods that aim to predict gene function automatically through integration of diverse genomic datasets, and that employ the GO, must be able to predict such multiple terms. We attribute the low frequency with which multiple GO terms are used in Arabidopsis to deviating practices in the genome annotation and curation process between communities of annotators. This may bias genome-scale comparisons of gene function between different species. GO term assignment should therefore be performed according to strictly similar rules and standards.
基因本体论(GO)是一种广泛用于描述基因功能的受控词汇表。在本研究中,我们对七个深入研究物种的精心策划的基因组注释中多个层次独立的GO术语的使用情况进行了量化。在大多数基因组中,相当比例(6 - 60%)的基因已被用多个层次独立的术语进行注释。这可能是为了获得足够的描述特异性所必需的。一个明显的例外是拟南芥,其中用多个术语注释基因的频率要低得多(6 - 14%)。相比之下,对这七个物种蛋白质组中InterPro匹配项出现情况的分析,随后将这些匹配项映射到GO术语,并未发现拟南芥基因组有异常模式。本研究表明多个层次独立的GO术语在基因功能注释中被广泛使用。因此,旨在通过整合各种基因组数据集自动预测基因功能并使用GO的概率方法,必须能够预测这些多个术语。我们将拟南芥中使用多个GO术语的频率低归因于注释者群体之间在基因组注释和策划过程中的不同做法。这可能会使不同物种之间基因功能的基因组规模比较产生偏差。因此,GO术语分配应根据严格相似的规则和标准进行。