Suppr超能文献

计算机化诊断决策支持系统——Isabel Pro 与 ChatGPT4 的性能比较研究。

Computerized diagnostic decision support systems - a comparative performance study of Isabel Pro vs. ChatGPT4.

机构信息

D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, USA.

出版信息

Diagnosis (Berl). 2024 May 7;11(3):250-258. doi: 10.1515/dx-2024-0033. eCollection 2024 Aug 1.

Abstract

OBJECTIVES

Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro.

METHODS

This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis's rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis.

RESULTS

Both systems produced comprehensive differential diagnosis lists. Isabel Pro's list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %).

CONCLUSIONS

This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.

摘要

目的

通过将 ChatGPT4 生成的诊断列表与 Isabel Pro 进行比较,验证人工智能大型语言模型 ChatGPT4 的诊断准确性。

方法

本研究使用了 201 个病例,将 ChatGPT4 与 Isabel Pro 进行比较。系统输入相同。平均倒数秩(MRR)比较系统之间正确诊断的秩。Isabel Pro 根据症状在参考数据集出现的频率进行排名。ChatGPT4 用于对诊断进行排名的机制未知。Wilcoxon 符号秩和检验未能拒绝零假设。

结果

两个系统都生成了全面的鉴别诊断列表。Isabel Pro 的列表在提交后立即显示,而 ChatGPT4 需要几分钟。Isabel Pro 生成了 175 个(87.1%)正确诊断,ChatGPT4 生成了 165 个(82.1%)。ChatGPT4 的 MRR 为 0.428(排名 2.31),Isabel Pro 的 MRR 为 0.389(排名 2.57),平均每个系统的排名为 3。ChatGPT4 在排名 1、5 和 10 处的召回率较高,而 Isabel Pro 在排名 20、30 和 40 处的召回率较高。Wilcoxon 符号秩和检验证实,样本量不足以得出结论认为这两个系统是等效的。ChatGPT4 编造了参考文献和 DOI,生成了 145 个正确的参考文献(87.9%),但只有 52 个正确的 DOI(31.5%)。

结论

本研究验证了临床诊断决策支持系统的前景,包括人工智能(AI)形式的大型语言模型。在引用和可能的诊断准确性方面的幻觉问题得到解决之前,临床医生可能会在诊断中谨慎使用大型语言模型系统,如果使用的话。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验