Santus Enrico, Li Clara, Yala Adam, Peck Donald, Soomro Rufina, Faridi Naveen, Mamshad Isra, Tang Rong, Lanahan Conor R, Barzilay Regina, Hughes Kevin
Massachusetts Institute of Technology, Cambridge, MA.
Henry Ford Health System, Detroit, MI.
JCO Clin Cancer Inform. 2019 Jul;3:1-8. doi: 10.1200/CCI.18.00160.
Natural language processing (NLP) techniques have been adopted to reduce the curation costs of electronic health records. However, studies have questioned whether such techniques can be applied to data from previously unseen institutions. We investigated the performance of a common neural NLP algorithm on data from both known and heldout (ie, institutions whose data were withheld from the training set and only used for testing) hospitals. We also explored how diversity in the training data affects the system's generalization ability.
We collected 24,881 breast pathology reports from seven hospitals and manually annotated them with nine key attributes that describe types of atypia and cancer. We trained a convolutional neural network (CNN) on annotations from either only one (CNN1), only two (CNN2), or only four (CNN4) hospitals. The trained systems were tested on data from five organizations, including both known and heldout ones. For every setting, we provide the accuracy scores as well as the learning curves that show how much data are necessary to achieve good performance and generalizability.
The system achieved a cross-institutional accuracy of 93.87% when trained on reports from only one hospital (CNN1). Performance improved to 95.7% and 96%, respectively, when the system was trained on reports from two (CNN2) and four (CNN4) hospitals. The introduction of diversity during training did not lead to improvements on the known institutions, but it boosted performance on the heldout institutions. When tested on reports from heldout hospitals, CNN4 outperformed CNN1 and CNN2 by 2.13% and 0.3%, respectively.
Real-world scenarios require that neural NLP approaches scale to data from previously unseen institutions. We show that a common neural NLP algorithm for information extraction can achieve this goal, especially when diverse data are used during training.
自然语言处理(NLP)技术已被用于降低电子健康记录的整理成本。然而,研究人员质疑这些技术是否可应用于来自未知机构的数据。我们研究了一种常见的神经NLP算法在已知医院和保留医院(即数据未包含在训练集中,仅用于测试的机构)的数据上的性能。我们还探讨了训练数据的多样性如何影响系统的泛化能力。
我们从七家医院收集了24,881份乳腺病理报告,并手动标注了九个关键属性,这些属性描述了异型性和癌症的类型。我们在仅来自一家医院(CNN1)、仅来自两家医院(CNN2)或仅来自四家医院(CNN4)的标注数据上训练了一个卷积神经网络(CNN)。训练好的系统在包括已知和保留机构在内的五个组织的数据上进行了测试。对于每种设置,我们提供了准确率得分以及学习曲线,这些曲线展示了要实现良好性能和泛化能力需要多少数据。
当仅在一家医院(CNN1)的报告上进行训练时,该系统实现了93.87%的跨机构准确率。当系统在两家医院(CNN2)和四家医院(CNN4)的报告上进行训练时,性能分别提高到了95.7%和96%。在训练过程中引入多样性并没有提高在已知机构上的性能,但提高了在保留机构上的性能。当在保留医院的报告上进行测试时,CNN4分别比CNN1和CNN2的性能高出2.13%和0.3%。
现实世界的场景要求神经NLP方法能够扩展到来自未知机构的数据。我们表明,一种用于信息提取的常见神经NLP算法可以实现这一目标,特别是在训练过程中使用多样化数据时。