利用弱控制词汇进行生物医学文献中的句子分割。

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature.

作者信息

Satou Kenji, Yamamoto Kaoru

机构信息

School of Knowledge Science, Japan Advanced Institute of Science and Technology.

出版信息

In Silico Biol. 2005;5(1):67-79.

PMID:15972007

Abstract

Since biomedical texts contain a wide variety of domain specific terms, building a large dictionary to perform term matching is of great relevance. However, due to the existence of null boundary between adjacent terms, this matching is not a trivial problem. Moreover, it is known that generative words cannot be comprehensively included in a dictionary because their possible variations are infinite. In this study, we report our approach to dictionary building and term matching in biomedical texts. Large amount of terms with/without part-of-speech (POS) and/or category information were gathered, and a completion program generated approximately 1.36 million term variants to avoid stemming problems when matching terms. The dictionary was stored in a relational database management system (RDBMS) for quick lookup, and used by a matching program. Since the matching operation is not restricted to a substring surrounded by space characters, we can avoid the problem of null boundaries. This feature is also useful for generative words. Experimental results on GENIA corpus are promising: nearly half of the possible terms were correctly recognized as a meaningful segment, and most of the remaining half could be correctly recognized by some post-processing process, like chunking and further decomposition. It should be remarked that although we have not used term cost, connectivity cost, or syntactic information, reasonable segmentation and dictionary lookup were performed in most cases.

摘要

由于生物医学文本包含各种各样的领域特定术语，构建一个大型词典来进行术语匹配具有重大意义。然而，由于相邻术语之间存在空边界，这种匹配并非易事。此外，众所周知，生成词无法被全面收录在词典中，因为它们的可能变体是无限的。在本研究中，我们报告了我们在生物医学文本中构建词典和进行术语匹配的方法。收集了大量带有/不带有词性（POS）和/或类别信息的术语，并且一个补全程序生成了大约136万个术语变体，以避免在匹配术语时出现词干提取问题。该词典存储在关系数据库管理系统（RDBMS）中以便快速查找，并由一个匹配程序使用。由于匹配操作不限于由空格字符包围的子串，我们可以避免空边界问题。此特性对于生成词也很有用。在GENIA语料库上的实验结果很有前景：近一半的可能术语被正确识别为有意义的片段，并且其余一半中的大多数可以通过一些后处理过程（如分块和进一步分解）被正确识别。应该指出的是，尽管我们没有使用术语成本、连接成本或句法信息，但在大多数情况下仍进行了合理的分割和词典查找。

相似文献

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature.

In Silico Biol. 2005;5(1):67-79.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Discovering patterns to extract protein-protein interactions from full texts.

Bioinformatics. 2004 Dec 12;20(18):3604-12. doi: 10.1093/bioinformatics/bth451. Epub 2004 Jul 29.

A hybrid method for relation extraction from biomedical literature.

Int J Med Inform. 2006 Jun;75(6):443-55. doi: 10.1016/j.ijmedinf.2005.06.010. Epub 2005 Aug 10.

Recognizing names in biomedical texts using mutual information independence model and SVM plus sigmoid.

Int J Med Inform. 2006 Jun;75(6):456-67. doi: 10.1016/j.ijmedinf.2005.06.012. Epub 2005 Aug 19.

Query expansion with a medical ontology to improve a multimodal information retrieval system.

Comput Biol Med. 2009 Apr;39(4):396-403. doi: 10.1016/j.compbiomed.2009.01.012. Epub 2009 Mar 6.

Status of text-mining techniques applied to biomedical text.

Drug Discov Today. 2006 Apr;11(7-8):315-25. doi: 10.1016/j.drudis.2006.02.011.

Comparison of character-level and part of speech features for name recognition in biomedical texts.

J Biomed Inform. 2004 Dec;37(6):423-35. doi: 10.1016/j.jbi.2004.08.008.

Building a protein name dictionary from full text: a machine learning term extraction approach.

BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.

Improving the performance of dictionary-based approaches in protein name recognition.

J Biomed Inform. 2004 Dec;37(6):461-70. doi: 10.1016/j.jbi.2004.08.003.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用弱控制词汇进行生物医学文献中的句子分割。

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature.

作者信息

Satou Kenji, Yamamoto Kaoru

机构信息

School of Knowledge Science, Japan Advanced Institute of Science and Technology.

出版信息

In Silico Biol. 2005;5(1):67-79.

PMID:15972007

Abstract

摘要

利用弱控制词汇进行生物医学文献中的句子分割。

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature.

作者信息

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

利用弱控制词汇进行生物医学文献中的句子分割。

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature.

作者信息

机构信息

出版信息

相似文献