评估GPT-4在甲状腺超声诊断及治疗建议中的作用：采用思维链方法提高可解释性

Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach.

作者信息

Wang Zhixiang, Zhang Zhen, Traverso Alberto, Dekker Andre, Qian Linxue, Sun Pengfei

机构信息

Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China.

Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands.

出版信息

Quant Imaging Med Surg. 2024 Feb 1;14(2):1602-1615. doi: 10.21037/qims-23-1180. Epub 2024 Jan 11.

DOI:10.21037/qims-23-1180

PMID:38415150

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10895085/

Abstract

BACKGROUND

As artificial intelligence (AI) becomes increasingly prevalent in the medical field, the effectiveness of AI-generated medical reports in disease diagnosis remains to be evaluated. ChatGPT is a large language model developed by open AI with a notable capacity for text abstraction and comprehension. This study aimed to explore the capabilities, limitations, and potential of Generative Pre-trained Transformer (GPT)-4 in analyzing thyroid cancer ultrasound reports, providing diagnoses, and recommending treatment plans.

METHODS

Using 109 diverse thyroid cancer cases, we evaluated GPT-4's performance by comparing its generated reports to those from doctors with various levels of experience. We also conducted a Turing Test and a consistency analysis. To enhance the interpretability of the model, we applied the Chain of Thought (CoT) method to deconstruct the decision-making chain of the GPT model.

RESULTS

GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression, but showed limitations in diagnostic accuracy. In addition, our consistency analysis highlighted certain discrepancies in the AI's performance. The CoT method effectively enhanced the interpretability of the AI's decision-making process.

CONCLUSIONS

GPT-4 exhibits potential as a supplementary tool in healthcare, especially for generating thyroid gland diagnostic reports. Our proposed online platform, "ThyroAIGuide", alongside the CoT method, underscores the potential of AI to augment diagnostic processes, elevate healthcare accessibility, and advance patient education. However, the journey towards fully integrating AI into healthcare is ongoing, requiring continuous research, development, and careful monitoring by medical professionals to ensure patient safety and quality of care.

摘要

背景

随着人工智能（AI）在医学领域日益普及，人工智能生成的医学报告在疾病诊断中的有效性仍有待评估。ChatGPT是OpenAI开发的一种大型语言模型，具有显著的文本抽象和理解能力。本研究旨在探讨生成式预训练变换器（GPT）-4在分析甲状腺癌超声报告、提供诊断和推荐治疗方案方面的能力、局限性及潜力。

方法

我们使用109例不同的甲状腺癌病例，通过将GPT-4生成的报告与不同经验水平医生生成的报告进行比较，评估了GPT-4的性能。我们还进行了图灵测试和一致性分析。为提高模型的可解释性，我们应用了思维链（CoT）方法来解构GPT模型的决策链。

结果

GPT-4在报告结构、专业术语和表达清晰度方面表现出色，但在诊断准确性方面存在局限性。此外，我们的一致性分析突出了人工智能性能方面的某些差异。CoT方法有效地提高了人工智能决策过程的可解释性。

结论

GPT-4在医疗保健领域展现出作为辅助工具的潜力，特别是在生成甲状腺诊断报告方面。我们提出的在线平台“ThyroAIGuide”与CoT方法一起，凸显了人工智能增强诊断过程、提高医疗可及性和推进患者教育的潜力。然而，将人工智能全面整合到医疗保健中的进程仍在继续，需要医学专业人员持续进行研究、开发并仔细监测，以确保患者安全和医疗质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fee6/10895085/c729654cbd2b/qims-14-02-1602-f1.jpg

相似文献

Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach.

Quant Imaging Med Surg. 2024 Feb 1;14(2):1602-1615. doi: 10.21037/qims-23-1180. Epub 2024 Jan 11.

Preliminary experiments on interpretable ChatGPT-assisted diagnosis for breast ultrasound radiologists.

Quant Imaging Med Surg. 2024 Sep 1;14(9):6601-6612. doi: 10.21037/qims-24-141. Epub 2024 Aug 28.

Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.

JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.

Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4.

J Orthop Traumatol. 2023 Nov 28;24(1):61. doi: 10.1186/s10195-023-00740-4.

Evaluating ChatGPT-4's Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases.

JMIR Form Res. 2024 Jun 26;8:e59267. doi: 10.2196/59267.

Integrating AI in Lipedema Management: Assessing the Efficacy of GPT-4 as a Consultation Assistant.

Life (Basel). 2024 May 20;14(5):646. doi: 10.3390/life14050646.

GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.

World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports.

Eur Radiol. 2024 Jun;34(6):3566-3574. doi: 10.1007/s00330-023-10384-x. Epub 2023 Nov 8.

Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine.

Clin Pract. 2023 Nov 20;13(6):1460-1487. doi: 10.3390/clinpract13060130.

Diagnostic accuracy of large language models in psychiatry.

Asian J Psychiatr. 2024 Oct;100:104168. doi: 10.1016/j.ajp.2024.104168. Epub 2024 Jul 25.

引用本文的文献

Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study.

Korean J Radiol. 2025 Sep;26(9):817-831. doi: 10.3348/kjr.2025.0293.

Large language model integrations in cancer decision-making: a systematic review and meta-analysis.

NPJ Digit Med. 2025 Jul 17;8(1):450. doi: 10.1038/s41746-025-01824-7.

Can AI-Based ChatGPT Models Accurately Analyze Hand-Wrist Radiographs? A Comparative Study.

Diagnostics (Basel). 2025 Jun 14;15(12):1513. doi: 10.3390/diagnostics15121513.

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.

J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.

Large language models in oncology: a review.

BMJ Oncol. 2025 May 15;4(1):e000759. doi: 10.1136/bmjonc-2025-000759. eCollection 2025.

Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages.

Sci Rep. 2025 May 30;15(1):19028. doi: 10.1038/s41598-025-04309-5.

Large language models for efficient whole-organ MRI score-based reports and categorization in knee osteoarthritis.

Insights Imaging. 2025 May 14;16(1):100. doi: 10.1186/s13244-025-01976-w.

Applications of Natural Language Processing in Otolaryngology: A Scoping Review.

Laryngoscope. 2025 Sep;135(9):3049-3063. doi: 10.1002/lary.32198. Epub 2025 May 1.

Medical accuracy of artificial intelligence chatbots in oncology: a scoping review.

Oncologist. 2025 Apr 4;30(4). doi: 10.1093/oncolo/oyaf038.

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis.

JMIR Med Inform. 2025 Apr 25;13:e64963. doi: 10.2196/64963.

本文引用的文献

Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0.

Cancers (Basel). 2023 Jul 22;15(14):3717. doi: 10.3390/cancers15143717.

Can ChatGPT, an Artificial Intelligence Language Model, Provide Accurate and High-quality Patient Information on Prostate Cancer?

Urology. 2023 Oct;180:35-58. doi: 10.1016/j.urology.2023.05.040. Epub 2023 Jul 4.

Using AI-generated suggestions from ChatGPT to optimize clinical decision support.

J Am Med Inform Assoc. 2023 Jun 20;30(7):1237-1245. doi: 10.1093/jamia/ocad072.

Evaluating diagnostic content of AI-generated chest radiography: A multi-center visual Turing test.

PLoS One. 2023 Apr 12;18(4):e0279349. doi: 10.1371/journal.pone.0279349. eCollection 2023.

Explainable artificial intelligence for mental health through transparency and interpretability for understandability.

NPJ Digit Med. 2023 Jan 18;6(1):6. doi: 10.1038/s41746-023-00751-9.

The reproducibility issues that haunt health-care AI.

Nature. 2023 Jan;613(7943):402-403. doi: 10.1038/d41586-023-00023-2.

Artificial Intelligence Applications in Health Care Practice: Scoping Review.

J Med Internet Res. 2022 Oct 5;24(10):e40238. doi: 10.2196/40238.

Early detection of suspicious lymph nodes in differentiated thyroid cancer.

Expert Rev Endocrinol Metab. 2022 Sep;17(5):447-454. doi: 10.1080/17446651.2022.2112176. Epub 2022 Aug 21.

Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: Systematic Review.

JMIR Med Inform. 2022 Jul 19;10(7):e35724. doi: 10.2196/35724.

Opening the Black Box: The Promise and Limitations of Explainable Machine Learning in Cardiology.

Can J Cardiol. 2022 Feb;38(2):204-213. doi: 10.1016/j.cjca.2021.09.004. Epub 2021 Sep 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估GPT-4在甲状腺超声诊断及治疗建议中的作用：采用思维链方法提高可解释性

Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献