School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK.
BMC Bioinformatics. 2012 Jun 7;13:127. doi: 10.1186/1471-2105-13-127.
Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language.
Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation.
Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.
大多数主要的基因组计划和序列数据库都提供了他们的数据的 GO 注释,无论是自动的还是通过人工注释者提供的,从而创建了大量用 GO 语言编写的数据。用自然语言编写的文本表现出统计幂律行为,即齐夫定律,其指数可以提供有关所使用语言性质的有用信息。因此,我们探讨了这样一个假设,即 GO 注释集将表现出与自然语言相似的统计行为。
发现基因本体论注释项目的注释符合齐夫定律。令人惊讶的是,在语料库中使用三个 GO 子本体(功能、过程和组件)捕获的注释中,测量的幂律指数始终存在差异。通过使用 GO 证据代码过滤语料库,我们发现测量的幂律指数值会根据用于支持注释的证据代码以可预测的方式响应。
计算语言学技术可以为注释过程提供新的见解。GO 注释表现出与自然语言相似的统计行为,其测量指数提供了一个与用于支持注释的证据代码性质相关的信号,这表明测量指数可能提供了一个关于注释信息量的信号。