Suppr超能文献

计算机化诊断决策支持系统——伊莎贝尔专业版与ChatGPT-4 第二部分

Computerized diagnostic decision support systems-Isabel Pro versus ChatGPT-4 part II.

作者信息

Bridges Joe M, Jiang Xiaoqian, Ige Michael, Toyobo Oluwatoniloba

机构信息

D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

出版信息

JAMIA Open. 2025 Jun 16;8(3):ooaf048. doi: 10.1093/jamiaopen/ooaf048. eCollection 2025 Jun.

Abstract

OBJECTIVE

Does a Tree-of-Thought prompt and reconsideration of Isabel Pro's differential improve ChatGPT-4's accuracy; does increasing expert panel size improve ChatGPT-4's accuracy; does ChatGPT-4 produce consistent outputs in sequential requests; what is the frequency of fabricated references?

MATERIALS AND METHODS

Isabel Pro, a computerized diagnostic decision support system, and ChatGPT-4, a large language model. Using 201 cases from the New England Journal of Medicine, each system produced a differential diagnosis ranked by likelihood. Statistics were Mean Reciprocal Rank, Recall at Rank, Average Rank, Number of Correct Diagnoses, and Rank Improvement. For reproducibility, the study compared the initial expert panel run to each subsequent run, using the r-squared calculation from a scatter plot of each run.

RESULTS

ChatGPT-4 improved MRR and Recall at 10 to 0.72 but produced fewer correct diagnoses and lower average rank. Reconsideration of the Isabel Pro differential produced an improvement in Recall at 10 of 11%. The expert panel size of two produced the best result. The reproducibility runs were within 4% on average for Recall at 10, but the scatterplots showed an r-squared ranging from 0.44 to 034, suggesting poor reproducibility. Reference accuracy was 34.8% for citations and 37.8% for DOIs.

DISCUSSION

ChatGPT-4 performs well with images and electrocardiography and in administrative practice management, but diagnosis has not proven as promising.

CONCLUSIONS

As noted above, the results demonstrate concerns for diagnostic accuracy, reproducibility, and reference citation accuracy. Until these issues are resolved, clinical usage for diagnosis will be minimal, if at all.

摘要

目的

思维树提示和对伊莎贝尔专业版鉴别诊断的重新考量是否能提高ChatGPT-4的准确性;增加专家小组规模是否能提高ChatGPT-4的准确性;ChatGPT-4在连续请求中是否能产生一致的输出;伪造参考文献的频率是多少?

材料与方法

伊莎贝尔专业版,一种计算机化诊断决策支持系统,以及ChatGPT-4,一种大型语言模型。使用来自《新英格兰医学杂志》的201个病例,每个系统生成按可能性排序的鉴别诊断。统计指标有平均倒数排名、排名召回率、平均排名、正确诊断数量和排名提升。为了确保可重复性,该研究将初始专家小组的运行结果与后续每次运行结果进行比较,使用每次运行散点图的r平方计算。

结果

ChatGPT-4将平均倒数排名和10名内召回率提高到了0.72,但正确诊断数量减少且平均排名降低。对伊莎贝尔专业版鉴别诊断的重新考量使10名内召回率提高了11%。两名专家组成的小组产生了最佳结果。对于10名内召回率,可重复性运行结果平均在4%以内,但散点图显示r平方范围为0.44至0.34,表明可重复性较差。参考文献准确性方面,引文为34.8%,数字对象标识符为37.8%。

讨论

ChatGPT-4在处理图像和心电图以及行政实践管理方面表现良好,但在诊断方面尚未证明有太大前景。

结论

如上所述,结果表明在诊断准确性、可重复性和参考文献引用准确性方面存在问题。在这些问题得到解决之前,诊断的临床应用将极少,甚至可能根本没有。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b9a/12169417/49e87c59b677/ooaf048f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验