Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.
Department of Information Technology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China.
BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):66. doi: 10.1186/s12911-019-0770-7.
Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words.
In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus.
When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure.
Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.
中文分词(CWS)和词性标注(POS)是中文文本处理的两个基本任务。它们通常是许多中文自然语言处理(NLP)任务的初步步骤。在各个领域已经有大量关于 CWS 和 POS 标注的研究,但针对临床领域的 CWS 和 POS 标注研究很少,因为确定词的粒度并不容易。
在本文中,我们研究了中文临床文本的细粒度 CWS 和 POS 标注,并进行了手动标注。在语料库上,我们比较了两种最先进的方法,即条件随机场(CRF)和具有 CRF 层的双向长短期记忆(BiLSTM)。为了验证细粒度标注的合理性,我们进一步在另一个独立语料库上研究了 CWS 和 POS 标注对中文临床命名实体识别(NER)的影响。
仅考虑 CWS 时,CRF 在精度、召回率和 F1 度量方面优于 BiLSTM-CRF。当同时考虑 CWS 和 POS 标注时,CRF 也优于 BiLSTM。在 CWS 上,CRF 在 F1 度量上比 BiLSTM-CRF 高出 0.14%,在 POS 标注上高出 0.34%。CWS 信息带来的 F1 度量最大提高了 0.34%,而 CWS&POS 信息带来的 F1 度量最大提高了 0.74%。
我们提出的细粒度 CWS 和 POS 标注语料库是可靠且有意义的,因为在该语料库上开发的 CWS 和 POS 标注系统的输出提高了另一个独立语料库上的中文临床 NER 系统的性能。