Fan Jung-wei, Prasad Rashmi, Yabut Rommel M, Loomis Richard M, Zisook Daniel S, Mattison John E, Huang Yang
Kaiser Permanente Southern California, Pasadena, CA, USA.
AMIA Annu Symp Proc. 2011;2011:382-91. Epub 2011 Oct 22.
Part-of-speech (POS) tagging is a fundamental step required by various NLP systems. The training of a POS tagger relies on sufficient quality annotations. However, the annotation process is both knowledge-intensive and time-consuming in the clinical domain. A promising solution appears to be for institutions to share their annotation efforts, and yet there is little research on associated issues. We performed experiments to understand how POS tagging performance would be affected by using a pre-trained tagger versus raw training data across different institutions. We manually annotated a set of clinical notes at Kaiser Permanente Southern California (KPSC) and a set from the University of Pittsburg Medical Center (UPMC), and trained/tested POS taggers with intra- and inter-institution settings. The cTAKES POS tagger was also included in the comparison to represent a tagger partially trained from the notes of a third institution, Mayo Clinic at Rochester. Intra-institution 5-fold cross-validation estimated an accuracy of 0.953 and 0.945 on the KPSC and UPMC notes respectively. Trained purely on KPSC notes, the accuracy was 0.897 when tested on UPMC notes. Trained purely on UPMC notes, the accuracy was 0.904 when tested on KPSC notes. Applying the cTAKES tagger pre-trained with Mayo Clinic's notes, the accuracy was 0.881 on KPSC notes and 0.883 on UPMC notes. After adding UPMC annotations to KPSC training data, the average accuracy on tested KPSC notes increased to 0.965. After adding KPSC annotations to UPMC training data, the average accuracy on tested UPMC notes increased to 0.953. The results indicated: first, the performance of pre-trained POS taggers dropped about 5% when applied directly across the institutions; second, mixing annotations from another institution following the same guideline increased tagging accuracy for about 1%. Our findings suggest that institutions can benefit more from sharing raw annotations but less from sharing pre-trained models for the POS tagging task. We believe the study could also provide general insights on cross-institution data sharing for other types of NLP tasks.
词性标注(POS)是各种自然语言处理(NLP)系统所需的一个基本步骤。词性标注器的训练依赖于足够高质量的标注。然而,在临床领域,标注过程既需要大量知识又耗时。一个有前景的解决方案似乎是各机构共享他们的标注工作,但相关问题的研究却很少。我们进行了实验,以了解在不同机构中使用预训练的标注器与原始训练数据相比,词性标注性能会受到怎样的影响。我们手动标注了南加州凯撒永久医疗集团(KPSC)的一组临床记录以及匹兹堡大学医学中心(UPMC)的一组临床记录,并在机构内部和机构间设置下训练/测试词性标注器。比较中还包括了cTAKES词性标注器,以代表一个部分从罗切斯特梅奥诊所的记录中训练得到的标注器。机构内部5折交叉验证估计在KPSC和UPMC记录上的准确率分别为0.953和0.945。仅在KPSC记录上训练,在UPMC记录上测试时准确率为0.897。仅在UPMC记录上训练,在KPSC记录上测试时准确率为0.904。应用用梅奥诊所记录预训练的cTAKES标注器,在KPSC记录上的准确率为0.881,在UPMC记录上为0.883。在KPSC训练数据中添加UPMC标注后,测试的KPSC记录上的平均准确率提高到了0.965。在UPMC训练数据中添加KPSC标注后,测试的UPMC记录上的平均准确率提高到了0.953。结果表明:第一,直接跨机构应用预训练的词性标注器时,其性能下降约5%;第二,按照相同准则混合来自另一个机构的标注可将标注准确率提高约1%。我们的研究结果表明,对于词性标注任务,各机构从共享原始标注中能获得更多益处,而从共享预训练模型中获得的益处较少。我们相信该研究也能为其他类型的NLP任务的跨机构数据共享提供一般性见解。