Children's Hospital Boston and Harvard Medical School, Boston, Massachusetts 02114, USA.
J Am Med Inform Assoc. 2012 Jul-Aug;19(4):660-7. doi: 10.1136/amiajnl-2011-000599. Epub 2012 Jan 31.
To research computational methods for coreference resolution in the clinical narrative and build a system implementing the best methods.
The Ontology Development and Information Extraction corpus annotated for coreference relations consists of 7214 coreferential markables, forming 5992 pairs and 1304 chains. We trained classifiers with semantic, syntactic, and surface features pruned by feature selection. For the three system components--for the resolution of relative pronouns, personal pronouns, and noun phrases--we experimented with support vector machines with linear and radial basis function (RBF) kernels, decision trees, and perceptrons. Evaluation of algorithms and varied feature sets was performed using standard metrics.
The best performing combination is support vector machines with an RBF kernel and all features (MUC score=0.352, B(3)=0.690, CEAF=0.486, BLANC=0.596) outperforming a traditional decision tree baseline.
The application showed good performance similar to performance on general English text. The main error source was sentence distances exceeding a window of 10 sentences between markables. A possible solution to this problem is hinted at by the fact that coreferent markables sometimes occurred in predictable (although distant) note sections. Another system limitation is failure to fully utilize synonymy and ontological knowledge. Future work will investigate additional ways to incorporate syntactic features into the coreference problem.
We investigated computational methods for coreference resolution in the clinical narrative. The best methods are released as modules of the open source Clinical Text Analysis and Knowledge Extraction System and Ontology Development and Information Extraction platforms.
研究临床医学文献中代词消解的计算方法,并构建一个实现最佳方法的系统。
本体开发和信息抽取语料库中的共指关系经过标注,包含 7214 个共指标记,形成 5992 对和 1304 条链。我们使用语义、句法和表面特征训练分类器,并通过特征选择进行修剪。对于相对代词、人称代词和名词短语这三个系统组件,我们尝试了使用线性和径向基函数(RBF)核的支持向量机、决策树和感知器。使用标准指标对算法和不同的特征集进行评估。
性能最佳的组合是使用 RBF 核和所有特征的支持向量机(MUC 得分=0.352,B(3)=0.690,CEAF=0.486,BLANC=0.596),优于传统的决策树基线。
该应用程序表现出与一般英语文本相似的良好性能。主要的错误来源是标记之间的句子距离超过 10 个句子的窗口。解决这个问题的一个可能方法是,共指标记有时出现在可预测的(尽管距离较远)笔记部分。另一个系统限制是未能充分利用同义词和本体知识。未来的工作将研究将句法特征纳入共指问题的其他方法。
我们研究了临床医学文献中代词消解的计算方法。最佳方法作为开源临床文本分析和知识提取系统以及本体开发和信息抽取平台的模块发布。