Hatzivassiloglou Vasileios, Weng Wubin
Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027 USA.
Int J Med Inform. 2002 Dec 4;67(1-3):19-32. doi: 10.1016/s1386-5056(02)00054-0.
Much of knowledge modeling in the molecular biology domain involves interactions between proteins, genes, various forms of RNA, small molecules, etc. Interactions between these substances are typically extracted and codified manually, increasing the cost and time for modeling and substantially limiting the coverage of the resulting knowledge base. In this paper, we describe an automatic system that learns from text interaction verbs; these verbs can then form the core of automatically retrieved patterns which model classes of biological interactions. We investigate text features relating verbs with genes and proteins, and apply statistical tests and a logistic regression statistical model to determine whether a given verb belongs to the class of interaction verbs. Our system, AVAD, achieves over 87% precision and 82% recall when tested on an 11 million word corpus of journal articles. In addition, we compare the automatically obtained results with a manually constructed database of interaction verbs and show that the automatic approach can significantly enrich the manual list by detecting rarer interaction verbs that were omitted from the database.
分子生物学领域的许多知识建模都涉及蛋白质、基因、各种形式的RNA、小分子等之间的相互作用。这些物质之间的相互作用通常是手动提取和编码的,这增加了建模的成本和时间,并大大限制了所得知识库的覆盖范围。在本文中,我们描述了一个从文本交互动词中学习的自动系统;这些动词随后可以形成自动检索模式的核心,这些模式对生物相互作用的类别进行建模。我们研究将动词与基因和蛋白质相关联的文本特征,并应用统计测试和逻辑回归统计模型来确定给定动词是否属于交互动词类别。我们的系统AVAD在一个1100万字的期刊文章语料库上进行测试时,精度超过87%,召回率达到82%。此外,我们将自动获得的结果与手动构建的交互动词数据库进行比较,结果表明,自动方法可以通过检测数据库中遗漏的罕见交互动词,显著丰富手动列表。