Satou Kenji, Yamamoto Kaoru
School of Knowledge Science, Japan Advanced Institute of Science and Technology.
In Silico Biol. 2005;5(1):67-79.
Since biomedical texts contain a wide variety of domain specific terms, building a large dictionary to perform term matching is of great relevance. However, due to the existence of null boundary between adjacent terms, this matching is not a trivial problem. Moreover, it is known that generative words cannot be comprehensively included in a dictionary because their possible variations are infinite. In this study, we report our approach to dictionary building and term matching in biomedical texts. Large amount of terms with/without part-of-speech (POS) and/or category information were gathered, and a completion program generated approximately 1.36 million term variants to avoid stemming problems when matching terms. The dictionary was stored in a relational database management system (RDBMS) for quick lookup, and used by a matching program. Since the matching operation is not restricted to a substring surrounded by space characters, we can avoid the problem of null boundaries. This feature is also useful for generative words. Experimental results on GENIA corpus are promising: nearly half of the possible terms were correctly recognized as a meaningful segment, and most of the remaining half could be correctly recognized by some post-processing process, like chunking and further decomposition. It should be remarked that although we have not used term cost, connectivity cost, or syntactic information, reasonable segmentation and dictionary lookup were performed in most cases.
由于生物医学文本包含各种各样的领域特定术语,构建一个大型词典来进行术语匹配具有重大意义。然而,由于相邻术语之间存在空边界,这种匹配并非易事。此外,众所周知,生成词无法被全面收录在词典中,因为它们的可能变体是无限的。在本研究中,我们报告了我们在生物医学文本中构建词典和进行术语匹配的方法。收集了大量带有/不带有词性(POS)和/或类别信息的术语,并且一个补全程序生成了大约136万个术语变体,以避免在匹配术语时出现词干提取问题。该词典存储在关系数据库管理系统(RDBMS)中以便快速查找,并由一个匹配程序使用。由于匹配操作不限于由空格字符包围的子串,我们可以避免空边界问题。此特性对于生成词也很有用。在GENIA语料库上的实验结果很有前景:近一半的可能术语被正确识别为有意义的片段,并且其余一半中的大多数可以通过一些后处理过程(如分块和进一步分解)被正确识别。应该指出的是,尽管我们没有使用术语成本、连接成本或句法信息,但在大多数情况下仍进行了合理的分割和词典查找。