在 MEDLINE® 文本中识别结构良好的生物医学短语。

Identifying well-formed biomedical phrases in MEDLINE® text.

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

出版信息

J Biomed Inform. 2012 Dec;45(6):1035-41. doi: 10.1016/j.jbi.2012.05.005. Epub 2012 Jun 8.

DOI:10.1016/j.jbi.2012.05.005

PMID:22683889

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3465642/

Abstract

In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality.

摘要

在现代社会，人们经常通过检索系统来满足他们的信息需求。人类可理解的、形式正确的短语是人类与网络之间的一个重要接口，而能够使用这些短语进行索引和搜索对于人机交互是有益的。在本文中，我们考虑了在 MEDLINE 文档中识别人类可理解、形式正确且高质量的生物医学短语的问题。以前用于检测此类短语的主要方法是语法、统计和结合这两种方法的混合方法。在本文中，我们提出了一种用于识别高质量短语的监督学习方法。首先，我们从现有来源中获取一组已知的形式正确且有用的短语，并将这些短语标记为正例。然后，我们从 MEDLINE 中提取出大量不包含停用词或标点符号的多词字符串。我们认为这个未标记的集合包含许多形式正确的短语。我们的目标是识别这些额外的高质量短语。我们检查了各种特征组合和几种旨在解决此问题的机器学习策略。选择适当的机器学习方法和特征可以在大型集合中识别出可能是高质量短语的字符串。我们通过使用我们的方法对从 MEDLINE 中提取的多词字符串进行人工判断来评估我们的方法。我们发现，超过 85%的此类提取的短语候选词被人类判断为高质量。

相似文献

Identifying well-formed biomedical phrases in MEDLINE® text.在 MEDLINE® 文本中识别结构良好的生物医学短语。

J Biomed Inform. 2012 Dec;45(6):1035-41. doi: 10.1016/j.jbi.2012.05.005. Epub 2012 Jun 8.

Corpus-based statistical screening for phrase identification.基于语料库的短语识别统计筛选

J Am Med Inform Assoc. 2000 Sep-Oct;7(5):499-511. doi: 10.1136/jamia.2000.0070499.

Effective grading of termhood in biomedical literature.生物医学文献中足月状态的有效分级。

AMIA Annu Symp Proc. 2005;2005:809-13.

Identifying synonymy between relational phrases using word embeddings.使用词嵌入识别关系短语之间的同义关系。

J Biomed Inform. 2015 Aug;56:94-102. doi: 10.1016/j.jbi.2015.05.010. Epub 2015 May 22.

Semantic tagging for medical knowledge tracking.用于医学知识追踪的语义标记

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:6257-60. doi: 10.1109/IEMBS.2006.260154.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Information content in Medline record fields.医学在线数据库（Medline）记录字段中的信息内容。

Int J Med Inform. 2004 Jun 30;73(6):515-27. doi: 10.1016/j.ijmedinf.2004.02.008.

Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives.开发和评估 RapTAT：一种用于从医学叙述中映射短语概念的机器学习系统。

J Biomed Inform. 2014 Apr;48:54-65. doi: 10.1016/j.jbi.2013.11.008. Epub 2013 Dec 4.

On search guide phrase compilation for recommending home medical products.关于用于推荐家用医疗产品的搜索指南短语汇编。

Annu Int Conf IEEE Eng Med Biol Soc. 2010;2010:2167-71. doi: 10.1109/IEMBS.2010.5626435.

BIOSMILE: a semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features.BIOSMILE：一种用于生物医学动词的语义角色标注系统，它使用带有自动生成模板特征的最大熵模型。

BMC Bioinformatics. 2007 Sep 1;8:325. doi: 10.1186/1471-2105-8-325.

引用本文的文献

PubMed Phrases, an open set of coherent phrases for searching biomedical literature.PubMed 词组，一组用于搜索生物医学文献的开放式连贯词组。

Sci Data. 2018 Jun 12;5:180104. doi: 10.1038/sdata.2018.104.

MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank.医学主题词表现状：通过学习排序实现PubMed规模的自动医学主题词表索引编制。

J Biomed Semantics. 2017 Apr 17;8(1):15. doi: 10.1186/s13326-017-0123-3.

Retro: concept-based clustering of biomedical topical sets.回溯：基于概念的生物医学主题集聚类。

Bioinformatics. 2014 Nov 15;30(22):3240-8. doi: 10.1093/bioinformatics/btu514. Epub 2014 Jul 29.

本文引用的文献

How to Interpret PubMed Queries and Why It Matters.如何解读PubMed检索以及为何这很重要。

J Am Soc Inf Sci Technol. 2009 Feb;60(2):264-274. doi: 10.1002/asi.20979. Epub 2008 Nov 6.

The Ineffectiveness of Within - Document Term Frequency in Text Classification.文档内词频在文本分类中的无效性

Inf Retr Boston. 2009 Oct 1;12(5):509-525. doi: 10.1007/s10791-008-9069-5.

Abbreviation definition identification based on automatic precision estimates.基于自动精度估计的缩写定义识别。

BMC Bioinformatics. 2008 Sep 25;9:402. doi: 10.1186/1471-2105-9-402.

MedPost: a part-of-speech tagger for bioMedical text.MedPost：一种用于生物医学文本的词性标注器。

Bioinformatics. 2004 Sep 22;20(14):2320-1. doi: 10.1093/bioinformatics/bth227. Epub 2004 Apr 8.

Corpus-based statistical screening for phrase identification.基于语料库的短语识别统计筛选

J Am Med Inform Assoc. 2000 Sep-Oct;7(5):499-511. doi: 10.1136/jamia.2000.0070499.

Extracting noun phrases for all of MEDLINE.提取整个医学文献数据库（MEDLINE）中的名词短语。

Proc AMIA Symp. 1999:671-5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验