Suppr超能文献

利用弱控制词汇进行生物医学文献中的句子分割。

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature.

作者信息

Satou Kenji, Yamamoto Kaoru

机构信息

School of Knowledge Science, Japan Advanced Institute of Science and Technology.

出版信息

In Silico Biol. 2005;5(1):67-79.

Abstract

Since biomedical texts contain a wide variety of domain specific terms, building a large dictionary to perform term matching is of great relevance. However, due to the existence of null boundary between adjacent terms, this matching is not a trivial problem. Moreover, it is known that generative words cannot be comprehensively included in a dictionary because their possible variations are infinite. In this study, we report our approach to dictionary building and term matching in biomedical texts. Large amount of terms with/without part-of-speech (POS) and/or category information were gathered, and a completion program generated approximately 1.36 million term variants to avoid stemming problems when matching terms. The dictionary was stored in a relational database management system (RDBMS) for quick lookup, and used by a matching program. Since the matching operation is not restricted to a substring surrounded by space characters, we can avoid the problem of null boundaries. This feature is also useful for generative words. Experimental results on GENIA corpus are promising: nearly half of the possible terms were correctly recognized as a meaningful segment, and most of the remaining half could be correctly recognized by some post-processing process, like chunking and further decomposition. It should be remarked that although we have not used term cost, connectivity cost, or syntactic information, reasonable segmentation and dictionary lookup were performed in most cases.

摘要

由于生物医学文本包含各种各样的领域特定术语,构建一个大型词典来进行术语匹配具有重大意义。然而,由于相邻术语之间存在空边界,这种匹配并非易事。此外,众所周知,生成词无法被全面收录在词典中,因为它们的可能变体是无限的。在本研究中,我们报告了我们在生物医学文本中构建词典和进行术语匹配的方法。收集了大量带有/不带有词性(POS)和/或类别信息的术语,并且一个补全程序生成了大约136万个术语变体,以避免在匹配术语时出现词干提取问题。该词典存储在关系数据库管理系统(RDBMS)中以便快速查找,并由一个匹配程序使用。由于匹配操作不限于由空格字符包围的子串,我们可以避免空边界问题。此特性对于生成词也很有用。在GENIA语料库上的实验结果很有前景:近一半的可能术语被正确识别为有意义的片段,并且其余一半中的大多数可以通过一些后处理过程(如分块和进一步分解)被正确识别。应该指出的是,尽管我们没有使用术语成本、连接成本或句法信息,但在大多数情况下仍进行了合理的分割和词典查找。

相似文献

2
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.
3
Discovering patterns to extract protein-protein interactions from full texts.从全文中发现提取蛋白质-蛋白质相互作用的模式。
Bioinformatics. 2004 Dec 12;20(18):3604-12. doi: 10.1093/bioinformatics/bth451. Epub 2004 Jul 29.
4
A hybrid method for relation extraction from biomedical literature.一种从生物医学文献中提取关系的混合方法。
Int J Med Inform. 2006 Jun;75(6):443-55. doi: 10.1016/j.ijmedinf.2005.06.010. Epub 2005 Aug 10.
7
Status of text-mining techniques applied to biomedical text.应用于生物医学文本的文本挖掘技术现状。
Drug Discov Today. 2006 Apr;11(7-8):315-25. doi: 10.1016/j.drudis.2006.02.011.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验