Gao Shang, Young Michael T, Qiu John X, Yoon Hong-Jun, Christian James B, Fearn Paul A, Tourassi Georgia D, Ramanthan Arvind
Computational Science and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA.
Surveillance Informatics Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD, USA.
J Am Med Inform Assoc. 2018 Mar 1;25(3):321-330. doi: 10.1093/jamia/ocx131.
We explored how a deep learning (DL) approach based on hierarchical attention networks (HANs) can improve model performance for multiple information extraction tasks from unstructured cancer pathology reports compared to conventional methods that do not sufficiently capture syntactic and semantic contexts from free-text documents.
Data for our analyses were obtained from 942 deidentified pathology reports collected by the National Cancer Institute Surveillance, Epidemiology, and End Results program. The HAN was implemented for 2 information extraction tasks: (1) primary site, matched to 12 International Classification of Diseases for Oncology topography codes (7 breast, 5 lung primary sites), and (2) histological grade classification, matched to G1-G4. Model performance metrics were compared to conventional machine learning (ML) approaches including naive Bayes, logistic regression, support vector machine, random forest, and extreme gradient boosting, and other DL models, including a recurrent neural network (RNN), a recurrent neural network with attention (RNN w/A), and a convolutional neural network.
Our results demonstrate that for both information tasks, HAN performed significantly better compared to the conventional ML and DL techniques. In particular, across the 2 tasks, the mean micro and macro F-scores for the HAN with pretraining were (0.852,0.708), compared to naive Bayes (0.518, 0.213), logistic regression (0.682, 0.453), support vector machine (0.634, 0.434), random forest (0.698, 0.508), extreme gradient boosting (0.696, 0.522), RNN (0.505, 0.301), RNN w/A (0.637, 0.471), and convolutional neural network (0.714, 0.460).
HAN-based DL models show promise in information abstraction tasks within unstructured clinical pathology reports.
我们探讨了基于分层注意力网络(HAN)的深度学习(DL)方法与传统方法相比,如何提高从非结构化癌症病理报告中进行多信息提取任务的模型性能,传统方法无法充分捕捉自由文本文件中的句法和语义上下文。
我们分析的数据来自美国国家癌症研究所监测、流行病学和最终结果计划收集的942份去标识化病理报告。HAN被用于两项信息提取任务:(1)主要部位,与12个国际肿瘤疾病分类地形代码相匹配(7个乳腺、5个肺主要部位),以及(2)组织学分级分类,与G1 - G4相匹配。将模型性能指标与传统机器学习(ML)方法进行比较,包括朴素贝叶斯、逻辑回归、支持向量机、随机森林和极端梯度提升,以及其他DL模型,包括循环神经网络(RNN)、带注意力的循环神经网络(RNN w/A)和卷积神经网络。
我们的结果表明,对于这两项信息任务,HAN的表现明显优于传统的ML和DL技术。特别是,在这两项任务中,经过预训练的HAN的平均微观和宏观F分数分别为(0.852,0.708),相比之下,朴素贝叶斯为(0.518, 0.213),逻辑回归为(0.682, 0.453),支持向量机为(0.634, 0.434),随机森林为(0.698, 0.508),极端梯度提升为(0.696, 0.522),RNN为(0.505, 0.301),RNN w/A为(0.637, 0.471),卷积神经网络为(0.714, 0.460)。
基于HAN的DL模型在非结构化临床病理报告中的信息抽象任务中显示出前景。