Suppr超能文献

临床叙述中句子边界和缩写的检测。

Detection of sentence boundaries and abbreviations in clinical narratives.

作者信息

Kreuzthaler Markus, Schulz Stefan

出版信息

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.

Abstract

BACKGROUND

In Western languages the period character is highly ambiguous, due to its double role as sentence delimiter and abbreviation marker. This is particularly relevant in clinical free-texts characterized by numerous anomalies in spelling, punctuation, vocabulary and with a high frequency of short forms.

METHODS

The problem is addressed by two binary classifiers for abbreviation and sentence detection. A support vector machine exploiting a linear kernel is trained on different combinations of feature sets for each classification task. Feature relevance ranking is applied to investigate which features are important for the particular task. The methods are applied to German language texts from a medical record system, authored by specialized physicians.

RESULTS

Two collections of 3,024 text snippets were annotated regarding the role of period characters for training and testing. Cohen's kappa resulted in 0.98. For abbreviation and sentence boundary detection we can report an unweighted micro-averaged F-measure using a 10-fold cross validation of 0.97 for the training set. For test set based evaluation we obtained an unweighted micro-averaged F-measure of 0.95 for abbreviation detection and 0.94 for sentence delineation. Language-dependent resources and rules were found to have less impact on abbreviation detection than on sentence delineation.

CONCLUSIONS

Sentence detection is an important task, which should be performed at the beginning of a text processing pipeline. For the text genre under scrutiny we showed that support vector machines exploiting a linear kernel produce state of the art results for sentence boundary detection. The results are comparable with other sentence boundary detection methods applied to English clinical texts. We identified abbreviation detection as a supportive task for sentence delineation.

摘要

背景

在西方语言中,句号的作用非常模糊,因为它同时具有句子分隔符和缩写标记这两种功能。这在临床自由文本中尤为重要,这类文本存在大量拼写、标点、词汇异常,且缩写形式出现频率很高。

方法

通过两个用于缩写检测和句子检测的二元分类器来解决该问题。针对每个分类任务,在不同特征集组合上训练一个利用线性核的支持向量机。应用特征相关性排序来研究哪些特征对特定任务很重要。这些方法应用于由专业医生撰写的医疗记录系统中的德语语文本。

结果

针对句号在训练和测试中的作用,对3024个文本片段的两个集合进行了标注。科恩kappa系数为0.98。对于缩写和句子边界检测,我们可以报告使用10折交叉验证时训练集的未加权微平均F值为0.97。对于基于测试集的评估,缩写检测的未加权微平均F值为0.95,句子划分的未加权微平均F值为0.94。发现与语言相关的资源和规则对缩写检测的影响比对句子划分小。

结论

句子检测是一项重要任务,应在文本处理管道的开始阶段执行。对于所研究的文本类型,我们表明利用线性核的支持向量机在句子边界检测方面产生了当前的最优结果。这些结果与应用于英语临床文本的其他句子边界检测方法相当。我们将缩写检测确定为句子划分的一项辅助任务。

相似文献

1
Detection of sentence boundaries and abbreviations in clinical narratives.临床叙述中句子边界和缩写的检测。
BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.
5
Resolving abbreviations to their senses in Medline.在医学文献数据库(Medline)中解析缩写词的含义。
Bioinformatics. 2005 Sep 15;21(18):3658-64. doi: 10.1093/bioinformatics/bti586. Epub 2005 Jul 21.

引用本文的文献

6
Clinical concept extraction: A methodology review.临床概念提取:方法学综述。
J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6.

本文引用的文献

7
Building a high-quality sense inventory for improved abbreviation disambiguation.构建高质量的感观词库以提高缩写词消歧
Bioinformatics. 2010 May 1;26(9):1246-53. doi: 10.1093/bioinformatics/btq129. Epub 2010 Mar 25.
10

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验