Fukuda K, Tamura A, Tsunoda T, Takagi T
Human Genome Center, University of Tokyo, Japan.
Pac Symp Biocomput. 1998:707-18.
To solve the mystery of the life phenomenon, we must clarify when genes are expressed and how their products interact with each other. But since the amount of continuously updated knowledge on these interactions is massive and is only available in the form of published articles, an intelligent information extraction (IE) system is needed. To extract these information directly from articles, the system must firstly identify the material names. However, medical and biological documents often include proper nouns newly made by the authors, and conventional methods based on domain specific dictionaries cannot detect such unknown words or coinages. In this study, we propose a new method of extracting material names, PROPER, using surface clue on character strings. It extracts material names in the sentence with 94.70% precision and 98.84% recall, regardless of whether it is already known or newly defined.
为了解开生命现象之谜,我们必须弄清楚基因何时表达以及它们的产物如何相互作用。但是,由于关于这些相互作用的不断更新的知识量巨大,且仅以已发表文章的形式存在,因此需要一个智能信息提取(IE)系统。为了直接从文章中提取这些信息,该系统必须首先识别物质名称。然而,医学和生物学文献中常常包含作者新造的专有名词,基于特定领域词典的传统方法无法检测到这类未知词汇或新造词。在本研究中,我们提出了一种新的提取物质名称的方法——PROPER,它利用字符串的表面线索。无论该物质名称是已知的还是新定义的,它在句子中提取物质名称的精确率为94.70%,召回率为98.84%。