中文临床笔记中的推测检测：分词和嵌入模型的影响

Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models.

作者信息

Zhang Shaodian, Kang Tian, Zhang Xingting, Wen Dong, Elhadad Noémie, Lei Jianbo

机构信息

Department of Biomedical Informatics, Columbia University, New York, USA.

Center for Medical Informatics, Peking University, Beijing, China.

出版信息

J Biomed Inform. 2016 Apr;60:334-41. doi: 10.1016/j.jbi.2016.02.011. Epub 2016 Feb 26.

DOI:10.1016/j.jbi.2016.02.011

PMID:26923634

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5282586/

Abstract

Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting toward this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5103 gold-standard speculation annotations on 2000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language.

摘要

推测表示对某些事实的不确定性。在临床文本中，识别推测是自然语言处理（NLP）的关键步骤。虽然在许多语言中这都是一项艰巨的任务，但在中文临床记录中检测推测可能特别具有挑战性，因为分词可能是上游操作的必要步骤。本文的目的是构建一个用于中文临床记录的先进推测检测系统，并研究嵌入特征和分词对于这一总体任务是否值得利用。我们提出了一种基于序列标注的推测检测系统，该系统依赖于字符袋、词袋、字符嵌入和词嵌入的特征。我们在一个包含36,828条临床记录的新数据集上进行实验，其中2000条记录有5103个金标准推测注释，并比较了分别基于通用分词器和领域特定分词器给出的分词来计算词嵌入的系统。我们的系统能够达到F值测量高达92.2%的性能。我们证明分词对于生成高质量的词嵌入以促进下游信息提取应用至关重要，并表明领域相关的分词器对于中文临床NLP任务可能至关重要。

相似文献

Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models.中文临床笔记中的推测检测：分词和嵌入模型的影响

J Biomed Inform. 2016 Apr;60:334-41. doi: 10.1016/j.jbi.2016.02.011. Epub 2016 Feb 26.

Detecting negation and scope in Chinese clinical notes using character and word embedding.使用字符和词嵌入检测中文临床记录中的否定和范围

Comput Methods Programs Biomed. 2017 Mar;140:53-59. doi: 10.1016/j.cmpb.2016.11.009. Epub 2016 Nov 23.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Extracting important information from Chinese Operation Notes with natural language processing methods.运用自然语言处理方法从中文手术记录中提取重要信息。

J Biomed Inform. 2014 Apr;48:130-6. doi: 10.1016/j.jbi.2013.12.017. Epub 2014 Jan 31.

J Biomed Inform. 2019 Feb;90:103103. doi: 10.1016/j.jbi.2019.103103. Epub 2019 Jan 9.

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。

J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.社交媒体中的药物警戒：使用带有词嵌入聚类特征的序列标注挖掘药物不良反应提及信息。

J Am Med Inform Assoc. 2015 May;22(3):671-81. doi: 10.1093/jamia/ocu041. Epub 2015 Mar 9.

Medical Named Entity Extraction from Chinese Resident Admit Notes Using Character and Word Attention-Enhanced Neural Network.基于字符和词注意力增强神经网络的中文住院病案中医学命名实体抽取

Int J Environ Res Public Health. 2020 Mar 2;17(5):1614. doi: 10.3390/ijerph17051614.

The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study.预训练语言模型对跨语言医学文本中否定和推测检测的影响：比较研究

JMIR Med Inform. 2020 Dec 3;8(12):e18953. doi: 10.2196/18953.

Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network.基于深度神经网络的中文临床文本命名实体识别

Stud Health Technol Inform. 2015;216:624-8.

引用本文的文献

Creating an ignorance-base: Exploring known unknowns in the scientific literature.创建一个无知库：探索科学文献中的已知未知。

J Biomed Inform. 2023 Jul;143:104405. doi: 10.1016/j.jbi.2023.104405. Epub 2023 Jun 1.

Construction of an Assisted Model Based on Natural Language Processing for Automatic Early Diagnosis of Autoimmune Encephalitis.基于自然语言处理的辅助模型构建用于自身免疫性脑炎的自动早期诊断

Neurol Ther. 2022 Sep;11(3):1117-1134. doi: 10.1007/s40120-022-00355-7. Epub 2022 May 11.

Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach.西班牙语临床文本中的否定和不确定性检测：一种基于深度学习的方法。

PeerJ Comput Sci. 2022 Mar 7;8:e913. doi: 10.7717/peerj-cs.913. eCollection 2022.

Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine.基于中医临床记录构建细粒度实体识别语料库。

BMC Med Inform Decis Mak. 2020 Apr 6;20(1):64. doi: 10.1186/s12911-020-1079-2.

Feature extraction for phenotyping from semantic and knowledge resources.从语义和知识资源中进行表型特征提取。

J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.

Mining and standardizing chinese consumer health terms.中文消费者健康术语的挖掘和标准化。

BMC Med Inform Decis Mak. 2018 Dec 7;18(Suppl 5):120. doi: 10.1186/s12911-018-0695-6.

Clinical Natural Language Processing in languages other than English: opportunities and challenges.非英语语言的临床自然语言处理：机遇与挑战。

J Biomed Semantics. 2018 Mar 30;9(1):12. doi: 10.1186/s13326-018-0179-8.

Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing.理解医疗保健领域的大文本数据：临床自然语言处理部分的研究结果。

Yearb Med Inform. 2017 Aug;26(1):228-234. doi: 10.15265/IY-2017-027. Epub 2017 Sep 11.

A cascaded approach for Chinese clinical text de-identification with less annotation effort.一种用于中文临床文本去识别的级联方法，所需标注工作量较少。

J Biomed Inform. 2017 Sep;73:76-83. doi: 10.1016/j.jbi.2017.07.017. Epub 2017 Jul 26.

Classifying Chinese Questions Related to Health Care Posted by Consumers Via the Internet.对消费者通过互联网发布的与医疗保健相关的中文问题进行分类。

J Med Internet Res. 2017 Jun 20;19(6):e220. doi: 10.2196/jmir.7156.

本文引用的文献

Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network.基于深度神经网络的中文临床文本命名实体识别

Stud Health Technol Inform. 2015;216:624-8.

Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary.来自英文出院小结和中文出院小结可比语料库的双语术语对齐。

BMC Bioinformatics. 2015 May 9;16:149. doi: 10.1186/s12859-015-0606-0.

Characterizing the sublanguage of online breast cancer forums for medications, symptoms, and emotions.描述在线乳腺癌论坛中关于药物、症状和情绪的子语言。

AMIA Annu Symp Proc. 2014 Nov 14;2014:516-25. eCollection 2014.

Extracting important information from Chinese Operation Notes with natural language processing methods.运用自然语言处理方法从中文手术记录中提取重要信息。

J Biomed Inform. 2014 Apr;48:130-6. doi: 10.1016/j.jbi.2013.12.017. Epub 2014 Jan 31.

A comprehensive study of named entity recognition in Chinese clinical text.中文临床文本命名实体识别的综合研究。

J Am Med Inform Assoc. 2014 Sep-Oct;21(5):808-14. doi: 10.1136/amiajnl-2013-002381. Epub 2013 Dec 17.

Drug name recognition in biomedical texts: a machine-learning-based method.生物医学文本中的药物名称识别：一种基于机器学习的方法。

Drug Discov Today. 2014 May;19(5):610-7. doi: 10.1016/j.drudis.2013.10.006. Epub 2013 Oct 16.

Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries.使用中文出院小结中的对偶分解进行联合分割和命名实体识别。

J Am Med Inform Assoc. 2014 Feb;21(e1):e84-92. doi: 10.1136/amiajnl-2013-001806. Epub 2013 Aug 9.

Analyzing differences between chinese and english clinical text: a cross-institution comparison of discharge summaries in two languages.分析中文和英文临床文本之间的差异：两种语言出院小结的跨机构比较。

Stud Health Technol Inform. 2013;192:662-6.

Hedging their mets: the use of uncertainty terms in clinical documents and its potential implications when sharing the documents with patients.规避风险：临床文档中不确定性术语的使用及其在与患者共享文档时的潜在影响。

AMIA Annu Symp Proc. 2012;2012:321-30. Epub 2012 Nov 3.

Named entity recognition of follow-up and time information in 20,000 radiology reports.在 20,000 份放射学报告中识别随访和时间信息的实体。

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):792-9. doi: 10.1136/amiajnl-2012-000812. Epub 2012 Jul 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验