Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, 880 Kitakobayashi, Mibu-cho, Shimotsuga-gun, Tochigi, 321-0293, Japan, 81 282-86-1111, 81 282-86-4775.
JMIR Form Res. 2024 Nov 27;8:e58666. doi: 10.2196/58666.
BACKGROUND: Diagnostic errors are significant problems in medical care. Despite the usefulness of artificial intelligence (AI)-based diagnostic decision support systems, the overreliance of physicians on AI-generated diagnoses may lead to diagnostic errors. OBJECTIVE: We investigated the safe use of AI-based diagnostic decision support systems with trust calibration by adjusting trust levels to match the actual reliability of AI. METHODS: A quasi-experimental study was conducted at Dokkyo Medical University, Japan, with physicians allocated (1:1) to the intervention and control groups. A total of 20 clinical cases were created based on the medical histories recorded by an AI-driven automated medical history-taking system from actual patients who visited a community-based hospital in Japan. The participants reviewed the medical histories of 20 clinical cases generated by an AI-driven automated medical history-taking system with an AI-generated list of 10 differential diagnoses and provided 1 to 3 possible diagnoses. Physicians were asked whether the final diagnosis was in the AI-generated list of 10 differential diagnoses in the intervention group, which served as the trust calibration. We analyzed the diagnostic accuracy of physicians and the correctness of the trust calibration in the intervention group. We also investigated the relationship between the accuracy of the trust calibration and the diagnostic accuracy of physicians, and the physicians' confidence level regarding the use of AI. RESULTS: Among the 20 physicians assigned to the intervention (n=10) and control (n=10) groups, the mean age was 30.9 (SD 3.9) years and 31.7 (SD 4.2) years, the proportion of men was 80% and 60%, and the mean postgraduate year was 5.8 (SD 2.9) and 7.2 (SD 4.6), respectively, with no significant differences. The physicians' diagnostic accuracy was 41.5% in the intervention group and 46% in the control group, with no significant difference (95% CI -0.75 to 2.55; P=.27). The overall accuracy of the trust calibration was only 61.5%, and despite correct calibration, the diagnostic accuracy was 54.5%. In the multivariate logistic regression model, the accuracy of the trust calibration was a significant contributor to the diagnostic accuracy of physicians (adjusted odds ratio 5.90, 95% CI 2.93-12.46; P<.001). The mean confidence level for AI was 72.5% in the intervention group and 45% in the control group, with no significant difference. CONCLUSIONS: Trust calibration did not significantly improve physicians' diagnostic accuracy when considering the differential diagnoses generated by reading medical histories and the possible differential diagnosis lists of an AI-driven automated medical history-taking system. As this was a formative study, the small sample size and suboptimal trust calibration methods may have contributed to the lack of significant differences. This study highlights the need for a larger sample size and the implementation of supportive measures of trust calibration.
背景:诊断错误是医疗保健中的一个重大问题。尽管基于人工智能(AI)的诊断决策支持系统具有一定的作用,但医生过度依赖 AI 生成的诊断结果可能会导致诊断错误。
目的:我们通过调整信任水平以匹配 AI 的实际可靠性,研究了基于人工智能的诊断决策支持系统的安全使用方法,即信任校准。
方法:在日本的独协医科大学进行了一项准实验研究,将医生(1:1)分配到干预组和对照组。根据 AI 驱动的自动病史采集系统从日本一家社区医院的实际患者记录的病史,共创建了 20 个临床病例。参与者查看了由 AI 驱动的自动病史采集系统生成的 20 个临床病例的病史,并根据 AI 生成的 10 个鉴别诊断列表提供了 1 到 3 个可能的诊断。在干预组中,医生需要回答最终诊断是否在 AI 生成的 10 个鉴别诊断列表中,这就是信任校准。我们分析了医生的诊断准确性和干预组中信任校准的正确性。我们还研究了信任校准的准确性与医生的诊断准确性之间的关系,以及医生对 AI 使用的信心水平。
结果:在被分配到干预组(n=10)和对照组(n=10)的 20 名医生中,平均年龄分别为 30.9(SD 3.9)岁和 31.7(SD 4.2)岁,男性比例分别为 80%和 60%,平均研究生年限分别为 5.8(SD 2.9)年和 7.2(SD 4.6)年,差异均无统计学意义。干预组的医生诊断准确性为 41.5%,对照组为 46%,差异无统计学意义(95%CI-0.75 至 2.55;P=.27)。信任校准的总体准确性仅为 61.5%,尽管校准正确,但诊断准确性仍为 54.5%。在多变量逻辑回归模型中,信任校准的准确性是医生诊断准确性的一个显著因素(调整优势比 5.90,95%CI 2.93-12.46;P<.001)。干预组中医生对 AI 的平均信心水平为 72.5%,对照组为 45%,差异无统计学意义。
结论:在考虑读取病史和 AI 驱动的自动病史采集系统的可能鉴别诊断列表所生成的鉴别诊断时,信任校准并未显著提高医生的诊断准确性。由于这是一项形成性研究,样本量较小且信任校准方法欠佳,可能导致差异无统计学意义。本研究强调了需要更大的样本量和实施支持信任校准的措施。
Cochrane Database Syst Rev. 2008-7-16
Lancet Digit Health. 2024-8
Cochrane Database Syst Rev. 2023-11-15
Cochrane Database Syst Rev. 2005-7-20
Cochrane Database Syst Rev. 2022-5-20
Cochrane Database Syst Rev. 2018-2-6
Cochrane Database Syst Rev. 2013-7-30
Cochrane Database Syst Rev. 2017-10-3
Diagnosis (Berl). 2024-2-1
Int J Environ Res Public Health. 2021-2-21
PLoS One. 2020-2-21