Department of Biomedical Informatics, Arizona State University, Phoenix, AZ, USA.
J Biomed Inform. 2012 Feb;45(1):129-40. doi: 10.1016/j.jbi.2011.10.007. Epub 2011 Nov 7.
Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text. The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type "clinical trials" to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task. The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged F-score for exact match increased from 80.3% to 82.3% and the micro-averaged F-score based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data.
从临床叙述中提取概念(如药物、症状和诊断)是解锁其中知识并支持更高级推理应用(如诊断解释、疾病进展建模和治疗效果的智能分析)的基本使能技术。最近发布的经过去识别的临床叙述注释训练集促进了概念提取方法的发展和完善。然而,由于注释过程是劳动密集型的,因此训练数据在所涵盖的概念和概念模式方面必然是有限的,这会影响使用这些数据训练的有监督机器学习应用的性能。本文提出了一种通过将有监督机器学习与从附加未注释文本中相关词的分布中获得的语义相关性的经验学习相结合来最小化这种限制的方法。该方法使用顺序判别分类器(条件随机场)从临床叙述中提取医疗问题、治疗和测试的提及。它利用所有被索引为“临床试验”出版物类型的 Medline 摘要来估计 i2b2/VA 训练和测试语料库中词之间的相关性。除了传统的特征(如字典匹配、模式匹配和词性标签)之外,我们还将与问题词出现在相似上下文中的词(即使用常用余弦度量测量具有相似向量表示的词,其中向量表示是使用分布语义学方法得出的)用作特征。据我们所知,这是首次探索使用分布语义学的努力,分布语义学是从经常使用向量空间模型的未注释文本中经验性地获得的语义,用于序列分类任务,如概念提取。因此,我们首先尝试了不同的滑动窗口模型,并找到了在初步序列标记任务中性能最佳的参数模型。对该方法的评估是针对 i2b2/VA 概念提取语料库进行的,结果表明,在大型未注释语料库中基于词的分布纳入特征显著有助于概念提取。与作为基线的仅监督方法相比,精确匹配的微平均 F 分数从 80.3%增加到 82.3%,基于不精确匹配的微平均 F 分数从 89.7%增加到 91.3%。根据自举重采样方法和其他系统的性能,这些改进具有高度显著性。因此,通过利用从未注释数据中获得的词分布信息,分布语义特征显著提高了从临床叙述中提取概念的性能。