分布外数据集的不平衡问题：提高 TextCNN 对罕见癌症类型分类的鲁棒性。

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

机构信息

Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN 37830, USA; The Bredesen Center, The University of Tennessee, 821 Volunteer Blvd. Knoxville, TN 37996, USA.

Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN 37830, USA.

出版信息

J Biomed Inform. 2022 Jan;125:103957. doi: 10.1016/j.jbi.2021.103957. Epub 2021 Nov 22.

DOI:10.1016/j.jbi.2021.103957

PMID:34823030

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9274264/

Abstract

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

摘要

在过去的十年中，电子健康记录文档的广泛采用为信息挖掘创造了巨大的机会。使用机器学习和深度学习的自然语言处理 (NLP) 技术越来越广泛地用于从非结构化临床笔记中提取信息。最近，人们对在现实世界中部署机器学习模型时的性能差异给予了相当大的关注。在临床 NLP 领域，卷积神经网络 (CNN) 在自然分布转移下对癌症病理报告进行分类的鲁棒性仍然研究不足。在这项研究中，我们旨在量化和提高 CNN 在病理报告中临床文本自然演变导致的分布外 (OOD) 数据集上的文本分类性能。我们确定了由于癌症类型的不同流行率导致的类不平衡是性能下降的原因之一，并分析了在实际领域中部署模型时解决类不平衡的先前方法的影响。我们的结果表明，我们的新型分类专业化集成技术在罕见癌症类型的分类方面在宏 F1 分数方面优于其他方法。我们还发现，传统的集成方法在顶级类别中表现更好，导致更高的微 F1 分数。基于我们的发现，我们为其他机器学习从业者制定了一系列建议，说明如何在生物医学 NLP 应用中构建具有极端不平衡数据集的鲁棒模型。

相似文献

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

J Biomed Inform. 2022 Jan;125:103957. doi: 10.1016/j.jbi.2021.103957. Epub 2021 Nov 22.

A clinical text classification paradigm using weak supervision and deep representation.

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.

Identification of patients' smoking status using an explainable AI approach: a Danish electronic health records case study.

BMC Med Res Methodol. 2024 May 17;24(1):114. doi: 10.1186/s12874-024-02231-4.

Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing.

J Biomed Inform. 2022 Mar;127:103984. doi: 10.1016/j.jbi.2021.103984. Epub 2022 Jan 7.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

J Am Med Inform Assoc. 2020 Jan 1;27(1):89-98. doi: 10.1093/jamia/ocz153.

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.

J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.

Ensembles of natural language processing systems for portable phenotyping solutions.

J Biomed Inform. 2019 Dec;100:103318. doi: 10.1016/j.jbi.2019.103318. Epub 2019 Oct 23.

Natural language processing with deep learning for medical adverse event detection from free-text medical narratives: A case study of detecting total hip replacement dislocation.

Comput Biol Med. 2021 Feb;129:104140. doi: 10.1016/j.compbiomed.2020.104140. Epub 2020 Nov 24.

Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes.

J Am Med Inform Assoc. 2018 Jan 1;25(1):93-98. doi: 10.1093/jamia/ocx090.

A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.

BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.

引用本文的文献

Time Matters: Examine Temporal Effects on Biomedical Language Models.

AMIA Annu Symp Proc. 2025 May 22;2024:723-732. eCollection 2024.

Machine-Learning Approach to Identify Organic Functional Groups from FT-IR and NMR Spectral Data.

ACS Omega. 2025 Mar 19;10(12):12717-12723. doi: 10.1021/acsomega.5c01903. eCollection 2025 Apr 1.

EnDM-CPP: A Multi-view Explainable Framework Based on Deep Learning and Machine Learning for Identifying Cell-Penetrating Peptides with Transformers and Analyzing Sequence Information.

Interdiscip Sci. 2024 Dec 23. doi: 10.1007/s12539-024-00673-4.

Transforming Cancer Classification: The Role of Advanced Gene Selection.

Diagnostics (Basel). 2024 Nov 22;14(23):2632. doi: 10.3390/diagnostics14232632.

RNA-Seq analysis for breast cancer detection: a study on paired tissue samples using hybrid optimization and deep learning techniques.

J Cancer Res Clin Oncol. 2024 Oct 10;150(10):455. doi: 10.1007/s00432-024-05968-z.

The SEER Program's evolution: supporting clinically meaningful population-level research.

J Natl Cancer Inst Monogr. 2024 Aug 1;2024(65):110-117. doi: 10.1093/jncimonographs/lgae022.

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models.

Patterns (N Y). 2024 Feb 21;5(3):100933. doi: 10.1016/j.patter.2024.100933. eCollection 2024 Mar 8.

Topological Superconductors from a Materials Perspective.

Chem Mater. 2023 Aug 1;35(16):6184-6200. doi: 10.1021/acs.chemmater.3c00713. eCollection 2023 Aug 22.

Deep Learning for Medical Image-Based Cancer Diagnosis.

Cancers (Basel). 2023 Jul 13;15(14):3608. doi: 10.3390/cancers15143608.

Current and Emerging Informatics Initiatives Impactful to Cancer Registries.

J Registry Manag. 2022 Winter;49(4):153-160.

本文引用的文献

Deep Transfer Learning Across Cancer Registries for Information Extraction from Pathology Reports.

IEEE EMBS Int Conf Biomed Health Inform. 2019 May;2019. doi: 10.1109/bhi.2019.8834586. Epub 2019 Sep 12.

Deep active learning for classifying cancer pathology reports.

BMC Bioinformatics. 2021 Mar 9;22(1):113. doi: 10.1186/s12859-021-04047-1.

Limitations of Transformers on Clinical Text Classification.

IEEE J Biomed Health Inform. 2021 Sep;25(9):3596-3607. doi: 10.1109/JBHI.2021.3062322. Epub 2021 Sep 3.

Measuring Domain Shift for Deep Learning in Histopathology.

IEEE J Biomed Health Inform. 2021 Feb;25(2):325-336. doi: 10.1109/JBHI.2020.3032060. Epub 2021 Feb 5.

Classifying cancer pathology reports with hierarchical self-attention networks.

Artif Intell Med. 2019 Nov;101:101726. doi: 10.1016/j.artmed.2019.101726. Epub 2019 Oct 15.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

J Am Med Inform Assoc. 2020 Jan 1;27(1):89-98. doi: 10.1093/jamia/ocz153.

Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records.

Cancer Res. 2019 Nov 1;79(21):5463-5470. doi: 10.1158/0008-5472.CAN-19-0579. Epub 2019 Aug 8.

Clinical text classification with rule-based features and knowledge-guided convolutional neural networks.

BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):71. doi: 10.1186/s12911-019-0781-4.

Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning.

IEEE Trans Pattern Anal Mach Intell. 2019 Aug;41(8):1979-1993. doi: 10.1109/TPAMI.2018.2858821. Epub 2018 Jul 23.

Classifying medical relations in clinical text via convolutional neural networks.

Artif Intell Med. 2019 Jan;93:43-49. doi: 10.1016/j.artmed.2018.05.001. Epub 2018 May 18.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

分布外数据集的不平衡问题：提高 TextCNN 对罕见癌症类型分类的鲁棒性。

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献