Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN 37830, USA; The Bredesen Center, The University of Tennessee, 821 Volunteer Blvd. Knoxville, TN 37996, USA.
Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN 37830, USA.
J Biomed Inform. 2022 Jan;125:103957. doi: 10.1016/j.jbi.2021.103957. Epub 2021 Nov 22.
In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.
在过去的十年中,电子健康记录文档的广泛采用为信息挖掘创造了巨大的机会。使用机器学习和深度学习的自然语言处理 (NLP) 技术越来越广泛地用于从非结构化临床笔记中提取信息。最近,人们对在现实世界中部署机器学习模型时的性能差异给予了相当大的关注。在临床 NLP 领域,卷积神经网络 (CNN) 在自然分布转移下对癌症病理报告进行分类的鲁棒性仍然研究不足。在这项研究中,我们旨在量化和提高 CNN 在病理报告中临床文本自然演变导致的分布外 (OOD) 数据集上的文本分类性能。我们确定了由于癌症类型的不同流行率导致的类不平衡是性能下降的原因之一,并分析了在实际领域中部署模型时解决类不平衡的先前方法的影响。我们的结果表明,我们的新型分类专业化集成技术在罕见癌症类型的分类方面在宏 F1 分数方面优于其他方法。我们还发现,传统的集成方法在顶级类别中表现更好,导致更高的微 F1 分数。基于我们的发现,我们为其他机器学习从业者制定了一系列建议,说明如何在生物医学 NLP 应用中构建具有极端不平衡数据集的鲁棒模型。