Suppr超能文献

测试 ChatGPT 从转录的放射学结果生成鉴别诊断的能力和局限性。

Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings.

机构信息

From the Department of Radiological Sciences, UCI Medical Center, University of California, Irvine, 101 The City Dr S, Orange, CA 92868 (S.H.S., K.H., G.C., R. Hill, J.T., R. Houshyar, V.Y., M.T.); Anaheim, Calif (L.Y.); and Division of Research, Kaiser Permanente of Northern California, Pleasanton, Calif (A.L.N.).

出版信息

Radiology. 2024 Oct;313(1):e232346. doi: 10.1148/radiol.232346.

Abstract

Background The burgeoning interest in ChatGPT as a potentially useful tool in medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Purpose To evaluate the accuracy, reliability, and repeatability of differential diagnoses produced by ChatGPT from transcribed radiologic findings. Materials and Methods Cases selected from a radiology textbook series spanning a variety of imaging modalities, subspecialties, and anatomic pathologies were converted into standardized prompts that were entered into ChatGPT (GPT-3.5 and GPT-4 algorithms; April 3 to June 1, 2023). Responses were analyzed for accuracy via comparison with the final diagnosis and top 3 differential diagnosis provided in the textbook, which served as the ground truth. Reliability, defined based on the frequency of algorithmic hallucination, was assessed through the identification of factually incorrect statements and fabricated references. Comparisons were made between the algorithms using the McNemar test and a generalized estimating equation model framework. Test-retest repeatability was measured by obtaining 10 independent responses from both algorithms for 10 cases in each subspecialty, and calculating the average pairwise percent agreement and Krippendorff α. Results A total of 339 cases were collected across multiple radiologic subspecialties. The overall accuracy of GPT-3.5 and GPT-4 for final diagnosis was 53.7% (182 of 339) and 66.1% (224 of 339; < .001), respectively. The mean differential score (ie, proportion of top 3 diagnoses that matched the original literature differential diagnosis) for GPT-3.5 and GPT-4 was 0.50 and 0.54 ( = .06), respectively. Of the references provided in GPT-3.5 and GPT-4 responses, 39.9% (401 of 1006) and 14.3% (161 of 1124; < .001), respectively, were fabricated. GPT-3.5 and GPT-4 generated false statements in 16.2% (55 of 339) and 4.7% (16 of 339; < .001) of cases, respectively. The range of average pairwise percent agreement across subspecialties for the final diagnosis and top 3 differential diagnosis was 59%-98% and 23%-49%, respectively. Conclusion ChatGPT achieved the best results when the most up-to-date model (GPT-4) was used and when it was prompted for a single diagnosis. Hallucination frequency was lower with GPT-4 than with GPT-3.5, but repeatability was an issue for both models. © RSNA, 2024 See also the editorial by Chang in this issue.

摘要

背景 随着人们对 ChatGPT 在医学领域作为一种潜在有用工具的兴趣日益浓厚,对其能力和局限性进行系统评估变得非常必要。目的 评估 ChatGPT 根据转录的放射学发现生成鉴别诊断的准确性、可靠性和可重复性。材料与方法 从涵盖各种成像方式、亚专业和解剖病理学的放射学教科书系列中选择病例,将其转换为标准化提示,并将其输入 ChatGPT(GPT-3.5 和 GPT-4 算法;2023 年 4 月 3 日至 6 月 1 日)。通过与教科书提供的最终诊断和前 3 个鉴别诊断进行比较,分析准确性,教科书作为真实数据。可靠性基于算法幻觉的频率来评估,通过识别事实错误的陈述和虚构的参考文献来实现。使用 McNemar 检验和广义估计方程模型框架对算法进行比较。通过对每个亚专业的 10 个病例各进行 10 次独立的算法响应,计算平均成对百分比一致性和 Krippendorff α,来衡量测试-再测试的可重复性。结果 共收集了来自多个放射学亚专业的 339 个病例。GPT-3.5 和 GPT-4 对最终诊断的总体准确率分别为 53.7%(339 例中的 182 例)和 66.1%(339 例中的 224 例;<.001)。GPT-3.5 和 GPT-4 的平均鉴别评分(即,与原始文献鉴别诊断相匹配的前 3 个诊断的比例)分别为 0.50 和 0.54(=0.06)。GPT-3.5 和 GPT-4 响应中提供的参考文献中,分别有 39.9%(1006 个中的 401 个)和 14.3%(1124 个中的 161 个;<.001)是虚构的。GPT-3.5 和 GPT-4 在 16.2%(339 例中的 55 例)和 4.7%(339 例中的 16 例;<.001)的病例中生成了错误陈述。在最终诊断和前 3 个鉴别诊断方面,各亚专业的平均成对百分比一致性范围为 59%-98%和 23%-49%。结论 在使用最新模型(GPT-4)并提示单个诊断时,ChatGPT 取得了最佳效果。GPT-4 的幻觉频率低于 GPT-3.5,但两个模型的可重复性都是一个问题。©RSNA,2024 也可参见本期 Chang 的社论。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验