临床叙述中句子边界和缩写的检测。

Detection of sentence boundaries and abbreviations in clinical narratives.

作者信息

Kreuzthaler Markus, Schulz Stefan

出版信息

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.

DOI:10.1186/1472-6947-15-S2-S4

PMID:26099994

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4474545/

Abstract

BACKGROUND

In Western languages the period character is highly ambiguous, due to its double role as sentence delimiter and abbreviation marker. This is particularly relevant in clinical free-texts characterized by numerous anomalies in spelling, punctuation, vocabulary and with a high frequency of short forms.

METHODS

The problem is addressed by two binary classifiers for abbreviation and sentence detection. A support vector machine exploiting a linear kernel is trained on different combinations of feature sets for each classification task. Feature relevance ranking is applied to investigate which features are important for the particular task. The methods are applied to German language texts from a medical record system, authored by specialized physicians.

RESULTS

Two collections of 3,024 text snippets were annotated regarding the role of period characters for training and testing. Cohen's kappa resulted in 0.98. For abbreviation and sentence boundary detection we can report an unweighted micro-averaged F-measure using a 10-fold cross validation of 0.97 for the training set. For test set based evaluation we obtained an unweighted micro-averaged F-measure of 0.95 for abbreviation detection and 0.94 for sentence delineation. Language-dependent resources and rules were found to have less impact on abbreviation detection than on sentence delineation.

CONCLUSIONS

Sentence detection is an important task, which should be performed at the beginning of a text processing pipeline. For the text genre under scrutiny we showed that support vector machines exploiting a linear kernel produce state of the art results for sentence boundary detection. The results are comparable with other sentence boundary detection methods applied to English clinical texts. We identified abbreviation detection as a supportive task for sentence delineation.

摘要

背景

在西方语言中，句号的作用非常模糊，因为它同时具有句子分隔符和缩写标记这两种功能。这在临床自由文本中尤为重要，这类文本存在大量拼写、标点、词汇异常，且缩写形式出现频率很高。

方法

通过两个用于缩写检测和句子检测的二元分类器来解决该问题。针对每个分类任务，在不同特征集组合上训练一个利用线性核的支持向量机。应用特征相关性排序来研究哪些特征对特定任务很重要。这些方法应用于由专业医生撰写的医疗记录系统中的德语语文本。

结果

针对句号在训练和测试中的作用，对3024个文本片段的两个集合进行了标注。科恩kappa系数为0.98。对于缩写和句子边界检测，我们可以报告使用10折交叉验证时训练集的未加权微平均F值为0.97。对于基于测试集的评估，缩写检测的未加权微平均F值为0.95，句子划分的未加权微平均F值为0.94。发现与语言相关的资源和规则对缩写检测的影响比对句子划分小。

结论

句子检测是一项重要任务，应在文本处理管道的开始阶段执行。对于所研究的文本类型，我们表明利用线性核的支持向量机在句子边界检测方面产生了当前的最优结果。这些结果与应用于英语临床文本的其他句子边界检测方法相当。我们将缩写检测确定为句子划分的一项辅助任务。

相似文献

Detection of sentence boundaries and abbreviations in clinical narratives.临床叙述中句子边界和缩写的检测。

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.

An easily implemented method for abbreviation expansion for the medical domain in Japanese text. A preliminary study.一种用于日语医学文本领域缩写扩展的易于实现的方法。一项初步研究。

Methods Inf Med. 2013;52(1):51-61. doi: 10.3414/ME12-01-0040. Epub 2012 Dec 7.

Unsupervised Abbreviation Expansion in Clinical Narratives.临床叙述中的无监督缩写扩展

Stud Health Technol Inform. 2017;245:539-543.

Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data.利用机器标注训练数据实现临床缩写词的全面消歧

AMIA Annu Symp Proc. 2017 Feb 10;2016:560-569. eCollection 2016.

Resolving abbreviations to their senses in Medline.在医学文献数据库（Medline）中解析缩写词的含义。

Bioinformatics. 2005 Sep 15;21(18):3658-64. doi: 10.1093/bioinformatics/bti586. Epub 2005 Jul 21.

Enhanced information retrieval from narrative German-language clinical text documents using automated document classification.使用自动文档分类从德语叙述性临床文本文件中增强信息检索。

Stud Health Technol Inform. 2008;136:473-8.

Comparison of character-level and part of speech features for name recognition in biomedical texts.生物医学文本中用于名称识别的字符级特征与词性特征比较。

J Biomed Inform. 2004 Dec;37(6):423-35. doi: 10.1016/j.jbi.2004.08.008.

Extracting medication information from French clinical texts.从法语临床文本中提取用药信息。

Stud Health Technol Inform. 2010;160(Pt 2):949-53.

Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles.使用MEDLINE作为知识来源来消除全文生物医学期刊文章中缩写词和首字母缩略词的歧义。

J Biomed Inform. 2007 Apr;40(2):150-9. doi: 10.1016/j.jbi.2006.06.001. Epub 2006 Jun 7.

PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。

J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.

引用本文的文献

Multi-label text classification via secondary use of large clinical real-world data sets.基于大型临床真实世界数据集的二次利用实现多标签文本分类。

Sci Rep. 2024 Nov 6;14(1):26972. doi: 10.1038/s41598-024-76424-8.

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation.使用机器翻译和词对齐创建德语医学命名实体识别模型和数据集：算法开发与验证

JMIR Form Res. 2023 Feb 28;7:e39077. doi: 10.2196/39077.

Exploring optimal granularity for extractive summarization of unstructured health records: Analysis of the largest multi-institutional archive of health records in Japan.探索非结构化健康记录提取式摘要的最佳粒度：对日本最大的多机构健康记录存档进行分析。

PLOS Digit Health. 2022 Sep 15;1(9):e0000099. doi: 10.1371/journal.pdig.0000099. eCollection 2022 Sep.

Estimating Sentence-like Structure in Synthetic Languages Using Information Topology.利用信息拓扑估计人工语言中类似句子的结构。

Entropy (Basel). 2022 Jun 22;24(7):859. doi: 10.3390/e24070859.

Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection.混合集成规则算法提高 MEDLINE® 句子边界检测。

AMIA Annu Symp Proc. 2022 Feb 21;2021:677-686. eCollection 2021.

Clinical concept extraction: A methodology review.临床概念提取：方法学综述。

J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6.

A Lightweight API-Based Approach for Building Flexible Clinical NLP Systems.基于轻量级 API 的构建灵活临床自然语言处理系统的方法。

J Healthc Eng. 2019 Aug 15;2019:3435609. doi: 10.1155/2019/3435609. eCollection 2019.

Current approaches to identify sections within clinical narratives from electronic health records: a systematic review.当前从电子健康记录中识别临床叙述部分的方法：系统评价。

BMC Med Res Methodol. 2019 Jul 18;19(1):155. doi: 10.1186/s12874-019-0792-y.

Deep Neural Architectures for Discourse Segmentation in E-Mail Based Behavioral Interventions.用于基于电子邮件的行为干预中语篇分割的深度神经架构

AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:443-452. eCollection 2019.

EHR problem list clustering for improved topic-space navigation.电子健康记录问题列表聚类，改善主题空间导航。

BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):72. doi: 10.1186/s12911-019-0789-9.

本文引用的文献

Combining corpus-derived sense profiles with estimated frequency information to disambiguate clinical abbreviations.结合源自语料库的词义概况与估计的频率信息来消除临床缩写的歧义。

AMIA Annu Symp Proc. 2012;2012:1004-13. Epub 2012 Nov 3.

A comparative study of current Clinical Natural Language Processing systems on handling abbreviations in discharge summaries.当前临床自然语言处理系统在处理出院小结中缩写词方面的比较研究。

AMIA Annu Symp Proc. 2012;2012:997-1003. Epub 2012 Nov 3.

Secondary use of clinical data in healthcare providers - an overview on research, regulatory and ethical requirements.医疗服务提供者对临床数据的二次使用——研究、监管及伦理要求概述

Stud Health Technol Inform. 2012;180:614-8.

Detecting abbreviations in discharge summaries using machine learning methods.使用机器学习方法检测出院小结中的缩写词。

AMIA Annu Symp Proc. 2011;2011:1541-9. Epub 2011 Oct 22.

Automatic acquisition of sublanguage semantic schema: towards the word sense disambiguation of clinical narratives.子语言语义模式的自动获取：迈向临床叙述的词义消歧

AMIA Annu Symp Proc. 2010 Nov 13;2010:612-6.

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.梅奥临床文本分析和知识提取系统（cTAKES）：架构、组件评估和应用。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):507-13. doi: 10.1136/jamia.2009.001560.

Building a high-quality sense inventory for improved abbreviation disambiguation.构建高质量的感观词库以提高缩写词消歧

Bioinformatics. 2010 May 1;26(9):1246-53. doi: 10.1093/bioinformatics/btq129. Epub 2010 Mar 25.

A study of abbreviations in clinical notes.临床记录中缩写的研究。

AMIA Annu Symp Proc. 2007 Oct 11;2007:821-5.

Extracting information from textual documents in the electronic health record: a review of recent research.从电子健康记录中的文本文件提取信息：近期研究综述

Yearb Med Inform. 2008:128-44.

Measuring agreement in medical informatics reliability studies.衡量医学信息学可靠性研究中的一致性

J Biomed Inform. 2002 Apr;35(2):99-110. doi: 10.1016/s1532-0464(02)00500-2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验