组织病理学数据集中潜在偏差因素的调查。

Investigation on potential bias factors in histopathology datasets.

作者信息

Kheiri Farnaz, Rahnamayan Shahryar, Makrehchi Masoud, Asilian Bidgoli Azam

机构信息

Department of Electrical, Computer and Software Engineering, Ontario Tech University, Oshawa, Canada.

Department of Engineering, Brock University, St. Catharines, Canada.

出版信息

Sci Rep. 2025 Apr 2;15(1):11349. doi: 10.1038/s41598-025-89210-x.

DOI:10.1038/s41598-025-89210-x

PMID:40175463

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11965531/

Abstract

Deep neural networks (DNNs) have demonstrated remarkable capabilities in medical applications, including digital pathology, where they excel at analyzing complex patterns in medical images to assist in accurate disease diagnosis and prognosis. However, concerns have arisen about potential biases in The Cancer Genome Atlas (TCGA) dataset, a comprehensive repository of digitized histopathology data and serves as both a training and validation source for deep learning models, suggesting that over-optimistic results of model performance may be due to reliance on biased features rather than histological characteristics. Surprisingly, recent studies have confirmed the existence of site-specific bias in the embedded features extracted for cancer-type discrimination, leading to high accuracy in acquisition site classification. This biased behavior motivated us to conduct an in-depth analysis to investigate potential causes behind this unexpected biased ability toward site-specific pattern recognition. The analysis was conducted on two cutting-edge DNN models: KimiaNet, a state-of-the-art DNN trained on TCGA images, and the self-trained EfficientNet. In this research study, the balanced accuracy metric is used to evaluate the performance of a model trained to classify data centers, which was originally designed to learn cancerous patterns, with the aim of investigating the potential factors contributing to the higher balanced accuracy in data center detection.

摘要

深度神经网络（DNN）在医学应用中展现出了卓越的能力，包括数字病理学领域，在该领域中，深度神经网络擅长分析医学图像中的复杂模式，以协助进行准确的疾病诊断和预后评估。然而，人们对癌症基因组图谱（TCGA）数据集的潜在偏差产生了担忧，该数据集是一个数字化组织病理学数据的综合存储库，同时作为深度学习模型的训练和验证来源，这表明模型性能的过度乐观结果可能是由于依赖有偏差的特征而非组织学特征。令人惊讶的是，最近的研究证实了在为癌症类型区分而提取的嵌入特征中存在特定部位偏差，从而导致采集部位分类的高精度。这种有偏差的行为促使我们进行深入分析，以探究这种对特定部位模式识别的意外偏差能力背后的潜在原因。该分析是在两个前沿的DNN模型上进行的：KimiaNet，一个在TCGA图像上训练的先进DNN，以及自训练的EfficientNet。在这项研究中，平衡准确率指标用于评估训练用于对数据中心进行分类的模型的性能，该模型最初旨在学习癌性模式，目的是探究导致数据中心检测中更高平衡准确率的潜在因素。