• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用大语言模型提高美国甲状腺结节诊断的一致性和准确性。

Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models.

机构信息

From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.).

出版信息

Radiology. 2024 Mar;310(3):e232255. doi: 10.1148/radiol.232255.

DOI:10.1148/radiol.232255
PMID:38470237
Abstract

Background Large language models (LLMs) hold substantial promise for medical imaging interpretation. However, there is a lack of studies on their feasibility in handling reasoning questions associated with medical diagnosis. Purpose To investigate the viability of leveraging three publicly available LLMs to enhance consistency and diagnostic accuracy in medical imaging based on standardized reporting, with pathology as the reference standard. Materials and Methods US images of thyroid nodules with pathologic results were retrospectively collected from a tertiary referral hospital between July 2022 and December 2022 and used to evaluate malignancy diagnoses generated by three LLMs-OpenAI's ChatGPT 3.5, ChatGPT 4.0, and Google's Bard. Inter- and intra-LLM agreement of diagnosis were evaluated. Then, diagnostic performance, including accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), was evaluated and compared for the LLMs and three interactive approaches: human reader combined with LLMs, image-to-text model combined with LLMs, and an end-to-end convolutional neural network model. Results A total of 1161 US images of thyroid nodules (498 benign, 663 malignant) from 725 patients (mean age, 42.2 years ± 14.1 [SD]; 516 women) were evaluated. ChatGPT 4.0 and Bard displayed substantial to almost perfect intra-LLM agreement (κ range, 0.65-0.86 [95% CI: 0.64, 0.86]), while ChatGPT 3.5 showed fair to substantial agreement (κ range, 0.36-0.68 [95% CI: 0.36, 0.68]). ChatGPT 4.0 had an accuracy of 78%-86% (95% CI: 76%, 88%) and sensitivity of 86%-95% (95% CI: 83%, 96%), compared with 74%-86% (95% CI: 71%, 88%) and 74%-91% (95% CI: 71%, 93%), respectively, for Bard. Moreover, with ChatGPT 4.0, the image-to-text-LLM strategy exhibited an AUC (0.83 [95% CI: 0.80, 0.85]) and accuracy (84% [95% CI: 82%, 86%]) comparable to those of the human-LLM interaction strategy with two senior readers and one junior reader and exceeding those of the human-LLM interaction strategy with one junior reader. Conclusion LLMs, particularly integrated with image-to-text approaches, show potential in enhancing diagnostic medical imaging. ChatGPT 4.0 was optimal for consistency and diagnostic accuracy when compared with Bard and ChatGPT 3.5. © RSNA, 2024

摘要

背景 大型语言模型(LLMs)在医学影像解释方面具有很大的潜力。然而,关于它们在处理与医学诊断相关的推理问题方面的可行性的研究还很少。目的 研究利用三个公开可用的 LLM 来提高医学影像基于标准化报告的一致性和诊断准确性的可行性,以病理学为参考标准。材料与方法 回顾性收集了 2022 年 7 月至 2022 年 12 月期间一家三级转诊医院的甲状腺结节 US 图像,并附有病理结果,用于评估三个 LLM-OpenAI 的 ChatGPT 3.5、ChatGPT 4.0 和 Google 的 Bard-生成的恶性肿瘤诊断。评估了诊断的组内和组间一致性。然后,评估并比较了 LLM 与三种交互方法的诊断性能,包括准确性、敏感度、特异性和接收者操作特征曲线下的面积(AUC):人类读者与 LLM 结合、图像到文本模型与 LLM 结合以及端到端卷积神经网络模型。结果 共评估了 725 名患者(平均年龄 42.2 岁±14.1[SD];516 名女性)的 1161 个甲状腺结节的 US 图像(498 个良性,663 个恶性)。ChatGPT 4.0 和 Bard 显示出几乎完美的内部 LLM 一致性(κ 值范围,0.65-0.86[95%CI:0.64,0.86]),而 ChatGPT 3.5 显示出适度到良好的一致性(κ 值范围,0.36-0.68[95%CI:0.36,0.68])。与 Bard 相比,ChatGPT 4.0 的准确率为 78%-86%(95%CI:76%,88%),敏感度为 86%-95%(95%CI:83%,96%),而 Bard 的准确率为 74%-86%(95%CI:71%,88%),敏感度为 74%-91%(95%CI:71%,93%)。此外,与 ChatGPT 4.0 相比,图像到文本-LLM 策略的 AUC(0.83[95%CI:0.80,0.85])和准确率(84%[95%CI:82%,86%])与两名资深读者和一名初级读者的人类-LLM 交互策略相当,超过了一名初级读者的人类-LLM 交互策略。结论 LLM,特别是与图像到文本方法相结合,在增强诊断医学影像方面具有潜力。与 Bard 和 ChatGPT 3.5 相比,ChatGPT 4.0 在一致性和诊断准确性方面表现最佳。

相似文献

1
Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models.利用大语言模型提高美国甲状腺结节诊断的一致性和准确性。
Radiology. 2024 Mar;310(3):e232255. doi: 10.1148/radiol.232255.
2
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.
3
Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。
Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.
4
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
5
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
6
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
7
Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.大语言模型(ChatGPT、必应搜索和谷歌巴德)在解决生理学病例 vignettes 中的表现。
Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.
8
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
9
Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.大型语言模型在造血干细胞移植导航中对医疗保健专业人员和患者的实用性:ChatGPT-3.5、ChatGPT-4 和 Bard 的性能比较。
J Med Internet Res. 2024 May 17;26:e54758. doi: 10.2196/54758.
10
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.人工智能在减重手术中的表现:ChatGPT-4、Bing 和 Bard 在《美国代谢与减重外科学会减重手术教科书》减重手术问题中的比较分析。
Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8.

引用本文的文献

1
A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study.一种作为甲状腺结节恶性风险端到端分类器的多模态大语言模型:可用性研究
JMIR Form Res. 2025 Aug 19;9:e70863. doi: 10.2196/70863.
2
ChatGPT-4 Vision: a promising tool for diagnosing thyroid nodules.ChatGPT-4视觉模型:一种用于诊断甲状腺结节的有前景的工具。
Front Med (Lausanne). 2025 Jul 30;12:1634976. doi: 10.3389/fmed.2025.1634976. eCollection 2025.
3
Exploring GPT-4o's multimodal reasoning capabilities with panoramic radiograph: the role of prompt engineering.
利用全景X线片探索GPT-4o的多模态推理能力:提示工程的作用。
Clin Oral Investig. 2025 Aug 12;29(9):405. doi: 10.1007/s00784-025-06498-9.
4
A narrative review on innovations of thyroid nodule ultrasound diagnosis: applications of robot and artificial intelligence technology.甲状腺结节超声诊断创新的叙述性综述:机器人与人工智能技术的应用
Gland Surg. 2025 Jul 31;14(7):1379-1389. doi: 10.21037/gs-2025-75. Epub 2025 Jul 28.
5
Machine learning approaches for EGFR mutation status prediction in NSCLC: an updated systematic review.用于非小细胞肺癌中表皮生长因子受体突变状态预测的机器学习方法:一项更新的系统评价
Front Oncol. 2025 Jul 10;15:1576461. doi: 10.3389/fonc.2025.1576461. eCollection 2025.
6
Ultrasound radiomics models improve preoperative diagnosis and reduce unnecessary biopsies in indeterminate thyroid nodules.超声影像组学模型可改善术前诊断并减少甲状腺结节性质不确定时不必要的活检。
Front Endocrinol (Lausanne). 2025 Jul 10;16:1615304. doi: 10.3389/fendo.2025.1615304. eCollection 2025.
7
Large language model integrations in cancer decision-making: a systematic review and meta-analysis.大型语言模型在癌症决策中的应用:一项系统综述和荟萃分析。
NPJ Digit Med. 2025 Jul 17;8(1):450. doi: 10.1038/s41746-025-01824-7.
8
Can AI-Based ChatGPT Models Accurately Analyze Hand-Wrist Radiographs? A Comparative Study.基于人工智能的ChatGPT模型能否准确分析手腕X光片?一项对比研究。
Diagnostics (Basel). 2025 Jun 14;15(12):1513. doi: 10.3390/diagnostics15121513.
9
Multimodal Deep Learning Based on Ultrasound Images and Clinical Data for Better Ovarian Cancer Diagnosis.基于超声图像和临床数据的多模态深度学习用于更好地诊断卵巢癌。
J Imaging Inform Med. 2025 Jun 24. doi: 10.1007/s10278-025-01566-8.
10
Using a Large Language Model for Breast Imaging Reporting and Data System Classification and Malignancy Prediction to Enhance Breast Ultrasound Diagnosis: Retrospective Study.使用大语言模型进行乳腺影像报告和数据系统分类及恶性肿瘤预测以增强乳腺超声诊断:回顾性研究
JMIR Med Inform. 2025 Jun 11;13:e70924. doi: 10.2196/70924.