Yoon Hong-Jun, Klasky Hilda B, Gounley John P, Alawad Mohammed, Gao Shang, Durbin Eric B, Wu Xiao-Cheng, Stroup Antoinette, Doherty Jennifer, Coyle Linda, Penberthy Lynne, Blair Christian J, Tourassi Georgia D
Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
College of Medicine, University of Kentucky, Lexington, KY 40536, United States of America.
J Biomed Inform. 2020 Oct;110:103564. doi: 10.1016/j.jbi.2020.103564. Epub 2020 Sep 9.
In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems.
The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem-thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL).
We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement.
Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.
在机器学习中,很明显,如果应用自助聚合(装袋法),任务性能的分类会提高。然而,深度神经网络的装袋法需要大量的计算资源和训练时间。我们在本研究中旨在回答的研究问题是,通过将一个问题分解为子问题,我们是否能够获得更高的任务性能分数并加速训练。
本研究中使用的数据包括电子癌症病理报告中的自由文本。我们应用装袋法,并使用多任务卷积神经网络(MT-CNN)和多任务分层卷积注意力网络(MT-HCAN)分类器进行分区数据训练。我们将一个大问题分解为20个子问题,对训练案例进行2000次重采样,并为每个自助样本和每个子问题训练深度学习模型,从而生成多达40000个模型。我们在橡树岭国家实验室(ORNL)的高性能计算环境中同时对许多模型进行训练。
我们证明,与单模型方法相比,模型聚合提高了任务性能,这与其他研究一致;并且我们证明,所提出的两种分区装袋方法在四项任务上获得了更高的分类准确率分数。值得注意的是,对于癌症组织学数据的提取,改进非常显著,该任务中有超过500个类别标签;这些结果表明,数据分区可能减轻任务的复杂性。相反,这些方法在部位和亚部位分类任务中没有获得优异的分数。本质上,由于数据分区基于原发性癌部位,准确率取决于分区的确定,这需要进一步研究和改进。
本研究结果表明:1. 数据分区和装袋策略获得了更高的性能分数。2. 我们利用ORNL的高性能Summit超级计算机实现了更快的训练。