一个用于临床文本的细粒度中文分词和词性标注语料库。

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.

机构信息

Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.

Department of Information Technology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China.

出版信息

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):66. doi: 10.1186/s12911-019-0770-7.

DOI:10.1186/s12911-019-0770-7

PMID:30961602

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6454584/

Abstract

BACKGROUND

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words.

METHODS

In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus.

RESULTS

When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure.

CONCLUSIONS

Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.

摘要

背景

中文分词（CWS）和词性标注（POS）是中文文本处理的两个基本任务。它们通常是许多中文自然语言处理（NLP）任务的初步步骤。在各个领域已经有大量关于 CWS 和 POS 标注的研究，但针对临床领域的 CWS 和 POS 标注研究很少，因为确定词的粒度并不容易。

方法

在本文中，我们研究了中文临床文本的细粒度 CWS 和 POS 标注，并进行了手动标注。在语料库上，我们比较了两种最先进的方法，即条件随机场（CRF）和具有 CRF 层的双向长短期记忆（BiLSTM）。为了验证细粒度标注的合理性，我们进一步在另一个独立语料库上研究了 CWS 和 POS 标注对中文临床命名实体识别（NER）的影响。

结果

仅考虑 CWS 时，CRF 在精度、召回率和 F1 度量方面优于 BiLSTM-CRF。当同时考虑 CWS 和 POS 标注时，CRF 也优于 BiLSTM。在 CWS 上，CRF 在 F1 度量上比 BiLSTM-CRF 高出 0.14%，在 POS 标注上高出 0.34%。CWS 信息带来的 F1 度量最大提高了 0.34%，而 CWS&POS 信息带来的 F1 度量最大提高了 0.74%。

结论

我们提出的细粒度 CWS 和 POS 标注语料库是可靠且有意义的，因为在该语料库上开发的 CWS 和 POS 标注系统的输出提高了另一个独立语料库上的中文临床 NER 系统的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9e2/6454584/e83c74e0a0bc/12911_2019_770_Fig1_HTML.jpg

相似文献

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.一个用于临床文本的细粒度中文分词和词性标注语料库。

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):66. doi: 10.1186/s12911-019-0770-7.

A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records.基于词性和自匹配注意力的深度学习模型在中文电子病历命名实体识别中的应用。

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):65. doi: 10.1186/s12911-019-0762-7.

Extracting clinical named entity for pituitary adenomas from Chinese electronic medical records.从中文电子病历中提取垂体腺瘤的临床命名实体。

BMC Med Inform Decis Mak. 2022 Mar 23;22(1):72. doi: 10.1186/s12911-022-01810-z.

A hybrid approach for named entity recognition in Chinese electronic medical record.中文电子病历命名实体识别的混合方法。

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):64. doi: 10.1186/s12911-019-0767-2.

An attention-based deep learning model for clinical named entity recognition of Chinese electronic medical records.基于注意力的深度学习模型在中文电子病历临床命名实体识别中的应用。

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):235. doi: 10.1186/s12911-019-0933-6.

Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training.基于词汇特征的 BiLSTM-CRF 和三训练的中药不良事件报告命名实体识别。

J Biomed Inform. 2019 Aug;96:103252. doi: 10.1016/j.jbi.2019.103252. Epub 2019 Jul 16.

A comprehensive study of named entity recognition in Chinese clinical text.中文临床文本命名实体识别的综合研究。

J Am Med Inform Assoc. 2014 Sep-Oct;21(5):808-14. doi: 10.1136/amiajnl-2013-002381. Epub 2013 Dec 17.

A Part-Of-Speech term weighting scheme for biomedical information retrieval.一种用于生物医学信息检索的词性术语加权方案。

J Biomed Inform. 2016 Oct;63:379-389. doi: 10.1016/j.jbi.2016.08.026. Epub 2016 Sep 1.

A Data-Driven Model for Automated Chinese Word Segmentation and POS Tagging.基于数据驱动的中文分词与词性标注自动化模型

Comput Intell Neurosci. 2022 Sep 16;2022:7622392. doi: 10.1155/2022/7622392. eCollection 2022.

Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

引用本文的文献

Named Entity Recognition in Electronic Health Records: A Methodological Review.电子健康记录中的命名实体识别：方法学综述

Healthc Inform Res. 2023 Oct;29(4):286-300. doi: 10.4258/hir.2023.29.4.286. Epub 2023 Oct 31.

The positive energy of netizens: development and application of fine-grained sentiment lexicon and emotional intensity model.网民正能量：细粒度情感词典与情感强度模型的发展与应用

Curr Psychol. 2022 Nov 3:1-18. doi: 10.1007/s12144-022-03876-4.

Construction and application of color fundus image segmentation algorithm based on Multi-Scale local combined global enhancement.基于多尺度局部与全局增强相结合的彩色眼底图像分割算法的构建与应用

Pak J Med Sci. 2021;37(6):1595-1599. doi: 10.12669/pjms.37.6-WIT.4848.

The Random Forest Model Has the Best Accuracy Among the Four Pressure Ulcer Prediction Models Using Machine Learning Algorithms.在使用机器学习算法的四种压疮预测模型中，随机森林模型具有最高的准确率。

Risk Manag Healthc Policy. 2021 Mar 18;14:1175-1187. doi: 10.2147/RMHP.S297838. eCollection 2021.

Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine.基于中医临床记录构建细粒度实体识别语料库。

BMC Med Inform Decis Mak. 2020 Apr 6;20(1):64. doi: 10.1186/s12911-020-1079-2.

Development and implementation of a dynamically updated big data intelligence platform from electronic health records for nasopharyngeal carcinoma research.开发和实施一个基于电子健康记录的鼻咽癌研究动态更新大数据智能平台。

Br J Radiol. 2019 Oct;92(1102):20190255. doi: 10.1259/bjr.20190255. Epub 2019 Aug 20.

本文引用的文献

Agreement, the f-measure, and reliability in information retrieval.信息检索中的一致性、F值与可靠性。

J Am Med Inform Assoc. 2005 May-Jun;12(3):296-8. doi: 10.1197/jamia.M1733. Epub 2005 Jan 31.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一个用于临床文本的细粒度中文分词和词性标注语料库。

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献