基于深度学习的人工智能与人类专家在临床诊断中的差异偏差和变异性：回顾性队列研究与调查研究

Differential Biases and Variabilities of Deep Learning-Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study.

作者信息

Cha Dongchul, Pae Chongwon, Lee Se A, Na Gina, Hur Young Kyun, Lee Ho Young, Cho A Ra, Cho Young Joon, Han Sang Gil, Kim Sung Huhn, Choi Jae Young, Park Hae-Jeong

机构信息

Department of Otorhinolaryngology, Yonsei University College of Medicine, Seoul, Republic of Korea.

Center for Systems and Translational Brain Sciences, Institute of Human Complexity and Systems Science, Yonsei University College of Medicine, Seoul, Republic of Korea.

出版信息

JMIR Med Inform. 2021 Dec 8;9(12):e33049. doi: 10.2196/33049.

DOI:10.2196/33049

PMID:34889764

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8701703/

Abstract

BACKGROUND

Deep learning (DL)-based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit large interindividual variability. Thus, understanding how the 2 groups classify given data differently is an essential step for the cooperative usage of DL in clinical application.

OBJECTIVE

This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems.

METHODS

We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models.

RESULTS

Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07).

CONCLUSIONS

Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation.

摘要

背景

基于深度学习（DL）的人工智能在医学诊断中可能具有与人类专家不同的诊断特征。作为一个数据驱动的知识系统，临床领域中不同人群的发病率被认为会给深度学习带来比临床医生更多的偏差。相反，由于接触的病例数量有限，人类专家可能会表现出较大的个体间差异。因此，了解这两组人如何对给定数据进行不同分类是深度学习在临床应用中协同使用的关键一步。

目的

本研究旨在以类别不平衡问题为例，评估和比较计算机和医生在耳内镜图像诊断中临床经验的差异影响，并在临床医生使用决策支持系统时提供指导。

方法

我们使用了2013年1月至2019年6月期间在韩国首尔Severance医院耳鼻喉科门诊就诊患者的数字耳内镜图像，共22707张耳内镜图像。我们排除了相似图像，选择了7500张耳内镜图像进行标注。我们构建了一个基于深度学习的图像分类模型，将给定图像分类为6种疾病类别。准备了两个包含300张图像的测试集：平衡测试集和不平衡测试集。我们纳入了14名临床医生（耳鼻喉科医生和包括全科医生在内的非耳鼻喉科专家）和13个基于深度学习的模型。我们使用准确率（总体和每类）和kappa统计量来比较个体医生和机器学习模型的结果。

结果

我们的机器学习模型始终具有较高的准确率（平衡测试集：平均77.14%，标准差1.83%；不平衡测试集：平均82.03%，标准差3.06%），与耳鼻喉科医生相当（平衡测试集：平均71.17%，标准差3.37%；不平衡测试集：平均72.84%，标准差6.41%），且远高于非耳鼻喉科医生（平衡测试集：平均45.63%，标准差7.89%；不平衡测试集：平均44.08%，标准差15.83%）。然而，机器学习模型存在类别不平衡问题（平衡测试集：平均77.14%，标准差1.83%；不平衡测试集：平均82.03%，标准差3.06%）。通过数据增强，特别是对于低发病率类别，这种情况得到了缓解，但罕见疾病类别的每类准确率仍然较低。人类医生虽然受患病率的影响较小，但个体间差异较大（机器学习模型：kappa = 0.83，标准差0.02；耳鼻喉科医生：kappa = 0.60，标准差0.07）。

结论

尽管机器学习模型在耳部疾病分类方面表现出色，但医生和机器学习模型都有各自的优势。机器学习模型在仅考虑给定图像时具有一致且较高的准确率，并表现出对患病率的偏差，而人类医生的表现各不相同，但不表现出对患病率的偏差，并且可能还会考虑图像之外的额外信息。为了在耳鼻喉科医生短缺的情况下提供最佳的患者护理，我们的机器学习模型可以为具有不同专业知识的临床医生发挥协同作用，只要记住模型仅考虑图像，并且即使在数据增强后仍可能对常见疾病存在偏差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df6f/8701703/4de42b3679ad/medinform_v9i12e33049_fig1.jpg

相似文献

Differential Biases and Variabilities of Deep Learning-Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study.

JMIR Med Inform. 2021 Dec 8;9(12):e33049. doi: 10.2196/33049.

Smartphone-based artificial intelligence using a transfer learning algorithm for the detection and diagnosis of middle ear diseases: A retrospective deep learning study.

EClinicalMedicine. 2022 Jul 12;51:101543. doi: 10.1016/j.eclinm.2022.101543. eCollection 2022 Sep.

SynthEye: Investigating the Impact of Synthetic Data on Artificial Intelligence-assisted Gene Diagnosis of Inherited Retinal Disease.

Ophthalmol Sci. 2022 Nov 22;3(2):100258. doi: 10.1016/j.xops.2022.100258. eCollection 2023 Jun.

Comparison of Artificial Intelligence Techniques to Evaluate Performance of a Classifier for Automatic Grading of Prostate Cancer From Digitized Histopathologic Images.

JAMA Netw Open. 2019 Mar 1;2(3):e190442. doi: 10.1001/jamanetworkopen.2019.0442.

Automated diagnosis of ear disease using ensemble deep learning with a big otoendoscopy image database.

EBioMedicine. 2019 Jul;45:606-614. doi: 10.1016/j.ebiom.2019.06.050. Epub 2019 Jul 1.

Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation.

JMIR AI. 2024 May 31;3:e58342. doi: 10.2196/58342.

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.

Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.

Efficient labelling for efficient deep learning: the benefit of a multiple-image-ranking method to generate high volume training data applied to ventricular slice level classification in cardiac MRI.

J Med Artif Intell. 2023 Apr;6:4. doi: 10.21037/jmai-22-55.

Deep Learning for Classification of Pediatric Otitis Media.

Laryngoscope. 2021 Jul;131(7):E2344-E2351. doi: 10.1002/lary.29302. Epub 2020 Dec 28.

The future of Cochrane Neonatal.

Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

引用本文的文献

NCT-CXR: Enhancing Pulmonary Abnormality Segmentation on Chest X-Rays Using Improved Coordinate Geometric Transformations.

J Imaging. 2025 Jun 5;11(6):186. doi: 10.3390/jimaging11060186.

Digital Otoscopy With Computer-Aided Composite Image Generation: Impact on the Correct Diagnosis, Confidence, and Time.

Otolaryngol Head Neck Surg. 2025 Jan;172(1):152-161. doi: 10.1002/ohn.965. Epub 2024 Sep 2.

Leveraging Interpretable Feature Representations for Advanced Differential Diagnosis in Computational Medicine.

Bioengineering (Basel). 2023 Dec 26;11(1):29. doi: 10.3390/bioengineering11010029.

Human-machine cooperation meta-model for clinical diagnosis by adaptation to human expert's diagnostic characteristics.

Sci Rep. 2023 Sep 27;13(1):16204. doi: 10.1038/s41598-023-43291-8.

Diagnosis, Treatment, and Management of Otitis Media with Artificial Intelligence.

Diagnostics (Basel). 2023 Jul 7;13(13):2309. doi: 10.3390/diagnostics13132309.

Evaluating the generalizability of deep learning image classification algorithms to detect middle ear disease using otoscopy.

Sci Rep. 2023 Apr 1;13(1):5368. doi: 10.1038/s41598-023-31921-0.

本文引用的文献

An Assistive Role of a Machine Learning Network in Diagnosis of Middle Ear Diseases.

J Clin Med. 2021 Jul 21;10(15):3198. doi: 10.3390/jcm10153198.

Human-computer collaboration for skin cancer recognition.

Nat Med. 2020 Aug;26(8):1229-1234. doi: 10.1038/s41591-020-0942-0. Epub 2020 Jun 22.

Automated diagnosis of ear disease using ensemble deep learning with a big otoendoscopy image database.

EBioMedicine. 2019 Jul;45:606-614. doi: 10.1016/j.ebiom.2019.06.050. Epub 2019 Jul 1.

Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study.

Lancet Oncol. 2019 Jul;20(7):938-947. doi: 10.1016/S1470-2045(19)30333-X. Epub 2019 Jun 12.

Individual differences in the learning potential of human beings.

NPJ Sci Learn. 2017 Jan 12;2:2. doi: 10.1038/s41539-016-0003-0. eCollection 2017.

High-performance medicine: the convergence of human and artificial intelligence.

Nat Med. 2019 Jan;25(1):44-56. doi: 10.1038/s41591-018-0300-7. Epub 2019 Jan 7.

Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet.

PLoS Med. 2018 Nov 27;15(11):e1002699. doi: 10.1371/journal.pmed.1002699. eCollection 2018 Nov.

Transfer learning for classification of cardiovascular tissues in histological images.

Comput Methods Programs Biomed. 2018 Oct;165:69-76. doi: 10.1016/j.cmpb.2018.08.006. Epub 2018 Aug 16.

Diagnostic accuracy and confidence for otoscopy: Are medical students receiving sufficient training?

Laryngoscope. 2019 Aug;129(8):1891-1897. doi: 10.1002/lary.27550. Epub 2018 Oct 17.

Focal Loss for Dense Object Detection.

IEEE Trans Pattern Anal Mach Intell. 2020 Feb;42(2):318-327. doi: 10.1109/TPAMI.2018.2858826. Epub 2018 Jul 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于深度学习的人工智能与人类专家在临床诊断中的差异偏差和变异性：回顾性队列研究与调查研究

Differential Biases and Variabilities of Deep Learning-Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献