Suppr超能文献

Claude 3 Opus 和 Claude 3.5 Sonnet 基于病史和放射科“诊断请”病例关键图像的诊断性能。

Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.

机构信息

Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.

Department of Radiology, Ichinomiyanishi Hospital, 1 Hira Kaimei, Ichinomiya-shi, Aichi, 494-0001, Japan.

出版信息

Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.

Abstract

PURPOSE

The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology's Diagnosis Please quiz questions.

MATERIALS AND METHODS

In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology's "Diagnosis Please" cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar's test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models.

RESULTS

The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model (p < 0.01). Claude 3.5 Sonnet outperformed in all conditions, but a statistically significant difference was observed only in the comparison for Condition 3 (30.1% vs. 24.8%, p = 0.028).

CONCLUSION

Two AI models demonstrated a significantly improved diagnostic performance when inputting both key images and clinical history. The models' ability to identify important differential diagnoses under these conditions was also confirmed.

摘要

目的

利用放射影像的大型语言人工智能(AI)模型的诊断性能尚未得到研究。我们使用 Claude 3 Opus(2024 年 3 月 4 日发布)和 Claude 3.5 Sonnet(2024 年 6 月 21 日发布)来调查它们对 Radiology's Diagnosis Please 测验问题的诊断性能。

材料和方法

在这项研究中,AI 模型被要求列出 322 个来自 Radiology's "Diagnosis Please"案例的测验问题的主要诊断和两个鉴别诊断,这些案例包括 1998 年至 2023 年发布的案例 1 至 322。分析在以下条件下进行:(1)条件 1:仅提交者提供的临床病史(文本)。(2)条件 2:提交者提供的临床病史和影像结果(文本)。(3)条件 3:临床病史(文本)和关键图像(PNG 文件)。我们应用 McNemar 检验来评估每个模型在条件 1、2 和 3 下以及在模型之间的整体准确性的正确响应率的差异。

结果

Claude 3 Opus 和 Claude 3.5 Sonnet 的正确诊断率分别为 58/322(18.0%)和 69/322(21.4%)、201/322(62.4%)和 209/322(64.9%)以及 80/322(24.8%)和 97/322(30.1%)。模型在条件 1、2 和 3 下提供了正确的鉴别诊断,在 Opus 中最多可达 26/322(8.1%),在 Sonnet 中可达 23/322(7.1%)。在每个模型的所有条件组合中,观察到正确响应率的统计学显著差异(p<0.01)。Claude 3.5 Sonnet 在所有条件下都表现出色,但在条件 3(30.1%对 24.8%,p=0.028)的比较中观察到统计学显著差异。

结论

两个 AI 模型在输入关键图像和临床病史时表现出显著提高的诊断性能。还证实了模型在这些条件下识别重要鉴别诊断的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc1f/11588754/207e5ed03c3a/11604_2024_1634_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验