Suppr超能文献

中文临床笔记中的推测检测:分词和嵌入模型的影响

Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models.

作者信息

Zhang Shaodian, Kang Tian, Zhang Xingting, Wen Dong, Elhadad Noémie, Lei Jianbo

机构信息

Department of Biomedical Informatics, Columbia University, New York, USA.

Center for Medical Informatics, Peking University, Beijing, China.

出版信息

J Biomed Inform. 2016 Apr;60:334-41. doi: 10.1016/j.jbi.2016.02.011. Epub 2016 Feb 26.

Abstract

Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting toward this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5103 gold-standard speculation annotations on 2000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language.

摘要

推测表示对某些事实的不确定性。在临床文本中,识别推测是自然语言处理(NLP)的关键步骤。虽然在许多语言中这都是一项艰巨的任务,但在中文临床记录中检测推测可能特别具有挑战性,因为分词可能是上游操作的必要步骤。本文的目的是构建一个用于中文临床记录的先进推测检测系统,并研究嵌入特征和分词对于这一总体任务是否值得利用。我们提出了一种基于序列标注的推测检测系统,该系统依赖于字符袋、词袋、字符嵌入和词嵌入的特征。我们在一个包含36,828条临床记录的新数据集上进行实验,其中2000条记录有5103个金标准推测注释,并比较了分别基于通用分词器和领域特定分词器给出的分词来计算词嵌入的系统。我们的系统能够达到F值测量高达92.2%的性能。我们证明分词对于生成高质量的词嵌入以促进下游信息提取应用至关重要,并表明领域相关的分词器对于中文临床NLP任务可能至关重要。

相似文献

引用本文的文献

5
Feature extraction for phenotyping from semantic and knowledge resources.从语义和知识资源中进行表型特征提取。
J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.
6
Mining and standardizing chinese consumer health terms.中文消费者健康术语的挖掘和标准化。
BMC Med Inform Decis Mak. 2018 Dec 7;18(Suppl 5):120. doi: 10.1186/s12911-018-0695-6.

本文引用的文献

5
A comprehensive study of named entity recognition in Chinese clinical text.中文临床文本命名实体识别的综合研究。
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):808-14. doi: 10.1136/amiajnl-2013-002381. Epub 2013 Dec 17.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验