Suppr超能文献

一个用于临床文本的细粒度中文分词和词性标注语料库。

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.

机构信息

Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.

Department of Information Technology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China.

出版信息

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):66. doi: 10.1186/s12911-019-0770-7.

Abstract

BACKGROUND

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words.

METHODS

In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus.

RESULTS

When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure.

CONCLUSIONS

Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.

摘要

背景

中文分词(CWS)和词性标注(POS)是中文文本处理的两个基本任务。它们通常是许多中文自然语言处理(NLP)任务的初步步骤。在各个领域已经有大量关于 CWS 和 POS 标注的研究,但针对临床领域的 CWS 和 POS 标注研究很少,因为确定词的粒度并不容易。

方法

在本文中,我们研究了中文临床文本的细粒度 CWS 和 POS 标注,并进行了手动标注。在语料库上,我们比较了两种最先进的方法,即条件随机场(CRF)和具有 CRF 层的双向长短期记忆(BiLSTM)。为了验证细粒度标注的合理性,我们进一步在另一个独立语料库上研究了 CWS 和 POS 标注对中文临床命名实体识别(NER)的影响。

结果

仅考虑 CWS 时,CRF 在精度、召回率和 F1 度量方面优于 BiLSTM-CRF。当同时考虑 CWS 和 POS 标注时,CRF 也优于 BiLSTM。在 CWS 上,CRF 在 F1 度量上比 BiLSTM-CRF 高出 0.14%,在 POS 标注上高出 0.34%。CWS 信息带来的 F1 度量最大提高了 0.34%,而 CWS&POS 信息带来的 F1 度量最大提高了 0.74%。

结论

我们提出的细粒度 CWS 和 POS 标注语料库是可靠且有意义的,因为在该语料库上开发的 CWS 和 POS 标注系统的输出提高了另一个独立语料库上的中文临床 NER 系统的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c9e2/6454584/e83c74e0a0bc/12911_2019_770_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验