Suppr超能文献

在 MEDLINE® 文本中识别结构良好的生物医学短语。

Identifying well-formed biomedical phrases in MEDLINE® text.

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

出版信息

J Biomed Inform. 2012 Dec;45(6):1035-41. doi: 10.1016/j.jbi.2012.05.005. Epub 2012 Jun 8.

Abstract

In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality.

摘要

在现代社会,人们经常通过检索系统来满足他们的信息需求。人类可理解的、形式正确的短语是人类与网络之间的一个重要接口,而能够使用这些短语进行索引和搜索对于人机交互是有益的。在本文中,我们考虑了在 MEDLINE 文档中识别人类可理解、形式正确且高质量的生物医学短语的问题。以前用于检测此类短语的主要方法是语法、统计和结合这两种方法的混合方法。在本文中,我们提出了一种用于识别高质量短语的监督学习方法。首先,我们从现有来源中获取一组已知的形式正确且有用的短语,并将这些短语标记为正例。然后,我们从 MEDLINE 中提取出大量不包含停用词或标点符号的多词字符串。我们认为这个未标记的集合包含许多形式正确的短语。我们的目标是识别这些额外的高质量短语。我们检查了各种特征组合和几种旨在解决此问题的机器学习策略。选择适当的机器学习方法和特征可以在大型集合中识别出可能是高质量短语的字符串。我们通过使用我们的方法对从 MEDLINE 中提取的多词字符串进行人工判断来评估我们的方法。我们发现,超过 85%的此类提取的短语候选词被人类判断为高质量。

相似文献

1
Identifying well-formed biomedical phrases in MEDLINE® text.在 MEDLINE® 文本中识别结构良好的生物医学短语。
J Biomed Inform. 2012 Dec;45(6):1035-41. doi: 10.1016/j.jbi.2012.05.005. Epub 2012 Jun 8.
2
Corpus-based statistical screening for phrase identification.基于语料库的短语识别统计筛选
J Am Med Inform Assoc. 2000 Sep-Oct;7(5):499-511. doi: 10.1136/jamia.2000.0070499.
4
Identifying synonymy between relational phrases using word embeddings.使用词嵌入识别关系短语之间的同义关系。
J Biomed Inform. 2015 Aug;56:94-102. doi: 10.1016/j.jbi.2015.05.010. Epub 2015 May 22.
5
Semantic tagging for medical knowledge tracking.用于医学知识追踪的语义标记
Conf Proc IEEE Eng Med Biol Soc. 2006;2006:6257-60. doi: 10.1109/IEMBS.2006.260154.
7
Information content in Medline record fields.医学在线数据库(Medline)记录字段中的信息内容。
Int J Med Inform. 2004 Jun 30;73(6):515-27. doi: 10.1016/j.ijmedinf.2004.02.008.

本文引用的文献

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验