Ogren P V, Cohen K B, Acquaah-Mensah G K, Eberlein J, Hunter L
University of Colorado at Boulder, Dept. of Computer Science, Boulder, CO, USA.
Pac Symp Biocomput. 2004:214-25. doi: 10.1142/9789812704856_0021.
An analysis of the term names in the Gene Ontology reveals the prevalence of substring relations between terms: 65.3% of all GO terms contain another GO term as a proper substring. This substring relation often coincides with a derivational relationship between the terms. For example, the term regulation of cell proliferation (GO:0042127) is derived from the term cell proliferation (GO:0008283) by addition of the phrase regulation of. Further, we note that particular substrings which are not themselves GO terms (e.g. regulation of in the preceding example) recur frequently and in consistent subtrees of the ontology, and that these frequently occurring substrings often indicate interesting semantic relationships between the related terms. We describe the extent of these phenomena--substring relations between terms, and the recurrence of derivational phrases such as regulation of--and propose that these phenomena can be exploited in various ways to make the information in GO more computationally accessible, to construct a conceptually richer representation of the data encoded in the ontology, and to assist in the analysis of natural language texts.
所有基因本体论术语中有65.3%包含另一个基因本体论术语作为其恰当的子串。这种子串关系常常与术语之间的派生关系相吻合。例如,细胞增殖调控(GO:0042127)这个术语是通过添加“调控”这个短语从细胞增殖(GO:0008283)这个术语派生而来的。此外,我们注意到,那些本身并非基因本体论术语的特定子串(如前例中的“调控”)在本体论的一致子树中频繁出现,并且这些频繁出现的子串常常表明相关术语之间存在有趣的语义关系。我们描述了这些现象的程度——术语之间的子串关系以及诸如“调控”等派生短语的反复出现——并提出可以通过多种方式利用这些现象,以使基因本体论中的信息在计算上更易于获取,构建一个在概念上更丰富的本体论编码数据表示,并协助分析自然语言文本。