He Bin, Dong Bin, Guan Yi, Yang Jinfeng, Jiang Zhipeng, Yu Qiubin, Cheng Jianyi, Qu Chunyan
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
Ricoh Software Research Center (Beijing), Beijing, China.
J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.
To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain.
An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus.
The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective.
The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency.
In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain.
构建一个涵盖中文临床文本句法和语义标注的综合语料库,并制定相应的标注指南和方法,同时开发基于该标注语料库训练的工具,为临床领域中文文本的研究提供基线。
提出一种迭代标注方法来培训标注人员并制定标注指南。然后,通过使用标注质量保证措施,构建了一个综合语料库,其中包含词性(POS)标签、句法标签、实体、断言和关系的标注。计算了标注者间一致性(IAA)以评估标注质量,并基于我们的标注语料库开发了一个中文临床文本处理与信息提取系统(CCTPIES)。
句法语料库由138篇中文临床文档组成,有47426个词元以及2612个完整的句法剖析树,而语义语料库包括992篇文档,这些文档标注了39511个带有断言的实体和7693种关系。IAA评估表明这个综合语料库质量良好,并且系统模块是有效的。
该标注语料库对临床领域中文文本的自然语言处理(NLP)研究做出了相当大的贡献。然而,这个语料库有一些局限性。应该引入一些其他类型的临床文本以提高语料库的覆盖范围,并且应该利用主动学习方法来提高标注效率。
在本研究中,提出了几种针对中文临床文本的标注指南和一种标注方法,并构建了一个带有其NLP模块的综合语料库,为进一步研究将NLP技术应用于临床领域的中文文本奠定了基础。