National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
J Biomed Inform. 2012 Dec;45(6):1035-41. doi: 10.1016/j.jbi.2012.05.005. Epub 2012 Jun 8.
In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality.
在现代社会,人们经常通过检索系统来满足他们的信息需求。人类可理解的、形式正确的短语是人类与网络之间的一个重要接口,而能够使用这些短语进行索引和搜索对于人机交互是有益的。在本文中,我们考虑了在 MEDLINE 文档中识别人类可理解、形式正确且高质量的生物医学短语的问题。以前用于检测此类短语的主要方法是语法、统计和结合这两种方法的混合方法。在本文中,我们提出了一种用于识别高质量短语的监督学习方法。首先,我们从现有来源中获取一组已知的形式正确且有用的短语,并将这些短语标记为正例。然后,我们从 MEDLINE 中提取出大量不包含停用词或标点符号的多词字符串。我们认为这个未标记的集合包含许多形式正确的短语。我们的目标是识别这些额外的高质量短语。我们检查了各种特征组合和几种旨在解决此问题的机器学习策略。选择适当的机器学习方法和特征可以在大型集合中识别出可能是高质量短语的字符串。我们通过使用我们的方法对从 MEDLINE 中提取的多词字符串进行人工判断来评估我们的方法。我们发现,超过 85%的此类提取的短语候选词被人类判断为高质量。