Basic Medical School, Chengdu University of Traditional Chinese Medicine, No. 37, Shi Er Qiao Road, Chengdu, 610075, People's Republic of China.
College of Software Engineering, Chengdu University of Information Technology, No. 24, Xue Fu Road, Chengdu, 610225, People's Republic of China.
BMC Med Inform Decis Mak. 2020 Apr 6;20(1):64. doi: 10.1186/s12911-020-1079-2.
In this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. Our aim is to provide a basis for the fine-grained corpus construction of TCM clinical records in future.
We developed a four-step approach that is suitable for the construction of TCM medical records in our corpus. First, we determined the entity types included in this study through sample annotation. Then, we drafted a fine-grained annotation guideline by summarizing the characteristics of the dataset and referring to some existing guidelines. We iteratively updated the guidelines until the inter-annotator agreement (IAA) exceeded a Cohen's kappa value of 0.9. Comprehensive annotations were performed while keeping the IAA value above 0.9.
We annotated the 10,197 clinical records in five rounds. Four entity categories involving 13 entity types were employed. The final fine-grained annotated entity corpus consists of 1104 entities and 67,799 tokens. The final IAAs are 0.936 on average (for three annotators), indicating that the fine-grained entity recognition corpus is of high quality.
These results will provide a foundation for future research on corpus construction and named entity recognition tasks in the TCM clinical domain.
在这项研究中,我们专注于构建具有中医临床记录相应注释指南的细粒度实体注释语料库。我们的目标是为未来中医临床记录的细粒度语料库建设提供基础。
我们开发了一种适用于语料库中中医病历构建的四步方法。首先,我们通过样本注释确定了本研究中包含的实体类型。然后,我们通过总结数据集的特点并参考一些现有指南,起草了一个细粒度的注释指南。我们不断更新指南,直到注释者间一致性(IAA)超过 0.9 的科恩氏 kappa 值。在保持 IAA 值高于 0.9 的情况下进行全面注释。
我们进行了五轮注释,共注释了 10197 份临床记录。使用了涉及 13 个实体类型的四个实体类别。最终的细粒度标注实体语料库包含 1104 个实体和 67799 个标记。最终的 IAA 平均为 0.936(对于三个注释者),表明细粒度的实体识别语料库质量很高。
这些结果将为未来中医临床领域的语料库建设和命名实体识别任务研究提供基础。