Suppr超能文献

基于深度学习的人工智能与人类专家在临床诊断中的差异偏差和变异性:回顾性队列研究与调查研究

Differential Biases and Variabilities of Deep Learning-Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study.

作者信息

Cha Dongchul, Pae Chongwon, Lee Se A, Na Gina, Hur Young Kyun, Lee Ho Young, Cho A Ra, Cho Young Joon, Han Sang Gil, Kim Sung Huhn, Choi Jae Young, Park Hae-Jeong

机构信息

Department of Otorhinolaryngology, Yonsei University College of Medicine, Seoul, Republic of Korea.

Center for Systems and Translational Brain Sciences, Institute of Human Complexity and Systems Science, Yonsei University College of Medicine, Seoul, Republic of Korea.

出版信息

JMIR Med Inform. 2021 Dec 8;9(12):e33049. doi: 10.2196/33049.

Abstract

BACKGROUND

Deep learning (DL)-based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit large interindividual variability. Thus, understanding how the 2 groups classify given data differently is an essential step for the cooperative usage of DL in clinical application.

OBJECTIVE

This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems.

METHODS

We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models.

RESULTS

Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07).

CONCLUSIONS

Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation.

摘要

背景

基于深度学习(DL)的人工智能在医学诊断中可能具有与人类专家不同的诊断特征。作为一个数据驱动的知识系统,临床领域中不同人群的发病率被认为会给深度学习带来比临床医生更多的偏差。相反,由于接触的病例数量有限,人类专家可能会表现出较大的个体间差异。因此,了解这两组人如何对给定数据进行不同分类是深度学习在临床应用中协同使用的关键一步。

目的

本研究旨在以类别不平衡问题为例,评估和比较计算机和医生在耳内镜图像诊断中临床经验的差异影响,并在临床医生使用决策支持系统时提供指导。

方法

我们使用了2013年1月至2019年6月期间在韩国首尔Severance医院耳鼻喉科门诊就诊患者的数字耳内镜图像,共22707张耳内镜图像。我们排除了相似图像,选择了7500张耳内镜图像进行标注。我们构建了一个基于深度学习的图像分类模型,将给定图像分类为6种疾病类别。准备了两个包含300张图像的测试集:平衡测试集和不平衡测试集。我们纳入了14名临床医生(耳鼻喉科医生和包括全科医生在内的非耳鼻喉科专家)和13个基于深度学习的模型。我们使用准确率(总体和每类)和kappa统计量来比较个体医生和机器学习模型的结果。

结果

我们的机器学习模型始终具有较高的准确率(平衡测试集:平均77.14%,标准差1.83%;不平衡测试集:平均82.03%,标准差3.06%),与耳鼻喉科医生相当(平衡测试集:平均71.17%,标准差3.37%;不平衡测试集:平均72.84%,标准差6.41%),且远高于非耳鼻喉科医生(平衡测试集:平均45.63%,标准差7.89%;不平衡测试集:平均44.08%,标准差15.83%)。然而,机器学习模型存在类别不平衡问题(平衡测试集:平均77.14%,标准差1.83%;不平衡测试集:平均82.03%,标准差3.06%)。通过数据增强,特别是对于低发病率类别,这种情况得到了缓解,但罕见疾病类别的每类准确率仍然较低。人类医生虽然受患病率的影响较小,但个体间差异较大(机器学习模型:kappa = 0.83,标准差0.02;耳鼻喉科医生:kappa = 0.60,标准差0.07)。

结论

尽管机器学习模型在耳部疾病分类方面表现出色,但医生和机器学习模型都有各自的优势。机器学习模型在仅考虑给定图像时具有一致且较高的准确率,并表现出对患病率的偏差,而人类医生的表现各不相同,但不表现出对患病率的偏差,并且可能还会考虑图像之外的额外信息。为了在耳鼻喉科医生短缺的情况下提供最佳的患者护理,我们的机器学习模型可以为具有不同专业知识的临床医生发挥协同作用,只要记住模型仅考虑图像,并且即使在数据增强后仍可能对常见疾病存在偏差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df6f/8701703/4de42b3679ad/medinform_v9i12e33049_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验