构建中文临床文本的综合句法和语义语料库。

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.

作者信息

He Bin, Dong Bin, Guan Yi, Yang Jinfeng, Jiang Zhipeng, Yu Qiubin, Cheng Jianyi, Qu Chunyan

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.

Ricoh Software Research Center (Beijing), Beijing, China.

出版信息

J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.

DOI:10.1016/j.jbi.2017.04.006

PMID:28404537

Abstract

OBJECTIVE

To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain.

MATERIALS AND METHODS

An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus.

RESULTS

The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective.

DISCUSSION

The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency.

CONCLUSIONS

In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain.

摘要

目的

构建一个涵盖中文临床文本句法和语义标注的综合语料库，并制定相应的标注指南和方法，同时开发基于该标注语料库训练的工具，为临床领域中文文本的研究提供基线。

材料与方法

提出一种迭代标注方法来培训标注人员并制定标注指南。然后，通过使用标注质量保证措施，构建了一个综合语料库，其中包含词性（POS）标签、句法标签、实体、断言和关系的标注。计算了标注者间一致性（IAA）以评估标注质量，并基于我们的标注语料库开发了一个中文临床文本处理与信息提取系统（CCTPIES）。

结果

句法语料库由138篇中文临床文档组成，有47426个词元以及2612个完整的句法剖析树，而语义语料库包括992篇文档，这些文档标注了39511个带有断言的实体和7693种关系。IAA评估表明这个综合语料库质量良好，并且系统模块是有效的。

讨论

该标注语料库对临床领域中文文本的自然语言处理（NLP）研究做出了相当大的贡献。然而，这个语料库有一些局限性。应该引入一些其他类型的临床文本以提高语料库的覆盖范围，并且应该利用主动学习方法来提高标注效率。

结论

在本研究中，提出了几种针对中文临床文本的标注指南和一种标注方法，并构建了一个带有其NLP模块的综合语料库，为进一步研究将NLP技术应用于临床领域的中文文本奠定了基础。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

构建中文临床文本的综合句法和语义语料库。

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSIONS

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

构建中文临床文本的综合句法和语义语料库。

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSIONS

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献