Suppr超能文献

基于自助聚合的癌症病理报告深度信息提取系统的加速训练

Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports.

作者信息

Yoon Hong-Jun, Klasky Hilda B, Gounley John P, Alawad Mohammed, Gao Shang, Durbin Eric B, Wu Xiao-Cheng, Stroup Antoinette, Doherty Jennifer, Coyle Linda, Penberthy Lynne, Blair Christian J, Tourassi Georgia D

机构信息

Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.

College of Medicine, University of Kentucky, Lexington, KY 40536, United States of America.

出版信息

J Biomed Inform. 2020 Oct;110:103564. doi: 10.1016/j.jbi.2020.103564. Epub 2020 Sep 9.

Abstract

OBJECTIVE

In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems.

MATERIALS AND METHODS

The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem-thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL).

RESULTS

We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement.

CONCLUSION

Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.

摘要

目的

在机器学习中,很明显,如果应用自助聚合(装袋法),任务性能的分类会提高。然而,深度神经网络的装袋法需要大量的计算资源和训练时间。我们在本研究中旨在回答的研究问题是,通过将一个问题分解为子问题,我们是否能够获得更高的任务性能分数并加速训练。

材料和方法

本研究中使用的数据包括电子癌症病理报告中的自由文本。我们应用装袋法,并使用多任务卷积神经网络(MT-CNN)和多任务分层卷积注意力网络(MT-HCAN)分类器进行分区数据训练。我们将一个大问题分解为20个子问题,对训练案例进行2000次重采样,并为每个自助样本和每个子问题训练深度学习模型,从而生成多达40000个模型。我们在橡树岭国家实验室(ORNL)的高性能计算环境中同时对许多模型进行训练。

结果

我们证明,与单模型方法相比,模型聚合提高了任务性能,这与其他研究一致;并且我们证明,所提出的两种分区装袋方法在四项任务上获得了更高的分类准确率分数。值得注意的是,对于癌症组织学数据的提取,改进非常显著,该任务中有超过500个类别标签;这些结果表明,数据分区可能减轻任务的复杂性。相反,这些方法在部位和亚部位分类任务中没有获得优异的分数。本质上,由于数据分区基于原发性癌部位,准确率取决于分区的确定,这需要进一步研究和改进。

结论

本研究结果表明:1. 数据分区和装袋策略获得了更高的性能分数。2. 我们利用ORNL的高性能Summit超级计算机实现了更快的训练。

相似文献

1
Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports.
J Biomed Inform. 2020 Oct;110:103564. doi: 10.1016/j.jbi.2020.103564. Epub 2020 Sep 9.
3
Scalable deep text comprehension for Cancer surveillance on high-performance computing.
BMC Bioinformatics. 2018 Dec 21;19(Suppl 18):488. doi: 10.1186/s12859-018-2511-9.
4
Development of message passing-based graph convolutional networks for classifying cancer pathology reports.
BMC Med Inform Decis Mak. 2024 Sep 17;24(Suppl 5):262. doi: 10.1186/s12911-024-02662-5.
5
Classifying cancer pathology reports with hierarchical self-attention networks.
Artif Intell Med. 2019 Nov;101:101726. doi: 10.1016/j.artmed.2019.101726. Epub 2019 Oct 15.
6
A clinical text classification paradigm using weak supervision and deep representation.
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
7
CapsTM: capsule network for Chinese medical text matching.
BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):94. doi: 10.1186/s12911-021-01442-9.
8
Cross-registry neural domain adaptation to extract mutational test results from pathology reports.
J Biomed Inform. 2019 Sep;97:103267. doi: 10.1016/j.jbi.2019.103267. Epub 2019 Aug 8.

引用本文的文献

1
DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case Abstraction.
JCO Clin Cancer Inform. 2023 Sep;7:e2300156. doi: 10.1200/CCI.23.00156.
2
Incidence, mortality, and survival of hematological malignancies in Northern Italian patients: an update to 2020.
Front Oncol. 2023 Jul 18;13:1182971. doi: 10.3389/fonc.2023.1182971. eCollection 2023.
4
DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case Abstraction.
medRxiv. 2023 Oct 26:2023.05.05.23289524. doi: 10.1101/2023.05.05.23289524.

本文引用的文献

1
Classifying cancer pathology reports with hierarchical self-attention networks.
Artif Intell Med. 2019 Nov;101:101726. doi: 10.1016/j.artmed.2019.101726. Epub 2019 Oct 15.
2
Editorial: The second international workshop on health natural language processing (HealthNLP 2019).
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):233. doi: 10.1186/s12911-019-0930-9.
4
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
5
6
Bagging and deep learning in optimal individualized treatment rules.
Biometrics. 2019 Jun;75(2):674-684. doi: 10.1111/biom.12990. Epub 2019 Mar 29.
7
Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.
CA Cancer J Clin. 2018 Nov;68(6):394-424. doi: 10.3322/caac.21492. Epub 2018 Sep 12.
8
Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports.
IEEE J Biomed Health Inform. 2018 Jan;22(1):244-251. doi: 10.1109/JBHI.2017.2700722. Epub 2017 May 3.
9
Epileptic seizure detection in EEG signals using tunable-Q factor wavelet transform and bootstrap aggregating.
Comput Methods Programs Biomed. 2016 Dec;137:247-259. doi: 10.1016/j.cmpb.2016.09.008. Epub 2016 Sep 26.
10
Using machine learning to parse breast pathology reports.
Breast Cancer Res Treat. 2017 Jan;161(2):203-211. doi: 10.1007/s10549-016-4035-1. Epub 2016 Nov 8.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验