Suppr超能文献

比较ChatGPT 4.0在使用2017版美国放射学会甲状腺影像报告和数据系统(ACR-TI-RADS)解读甲状腺结节超声报告方面的表现:不同超声用户经验水平的分析

Comparing ChatGPT 4.0's Performance in Interpreting Thyroid Nodule Ultrasound Reports Using ACR-TI-RADS 2017: Analysis Across Different Levels of Ultrasound User Experience.

作者信息

Wakonig Katharina Margherita, Barisch Simon, Kozarzewski Leonard, Dommerich Steffen, Lerchbaumer Markus Herbert

机构信息

Department of Otorhinolaryngology, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Campus Virchow Klinikum and Campus Charité Mitte, Charitéplatz 1, 10117 Berlin, Germany.

Department of Endocrinology, Diabetes and Metabolism, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Charitéplatz 1, 10117 Berlin, Germany.

出版信息

Diagnostics (Basel). 2025 Mar 6;15(5):635. doi: 10.3390/diagnostics15050635.

Abstract

This study evaluates ChatGPT 4.0's ability to interpret thyroid ultrasound (US) reports using ACR-TI-RADS 2017 criteria, comparing its performance with different levels of US users. A team of medical experts, an inexperienced US user, and ChatGPT 4.0 analyzed 100 fictitious thyroid US reports. ChatGPT's performance was assessed for accuracy, consistency, and diagnostic recommendations, including fine-needle aspirations (FNA) and follow-ups. ChatGPT demonstrated substantial agreement with experts in assessing echogenic foci, but inconsistencies in other criteria, such as composition and margins, were evident in both its analyses. Interrater reliability between ChatGPT and experts ranged from moderate to almost perfect, reflecting AI's potential but also its limitations in achieving expert-level interpretations. The inexperienced US user outperformed ChatGPT with a nearly perfect agreement with the experts, highlighting the critical role of traditional medical training in standardized risk stratification tools such as TI-RADS. ChatGPT showed high specificity in recommending FNAs but lower sensitivity and specificity for follow-ups compared to the medical student. These findings emphasize ChatGPT's potential as a supportive diagnostic tool rather than a replacement for human expertise. Enhancing AI algorithms and training could improve ChatGPT's clinical utility, enabling better support for clinicians in managing thyroid nodules and improving patient care. This study highlights both the promise and current limitations of AI in medical diagnostics, advocating for its refinement and integration into clinical workflows. However, it emphasizes that traditional clinical training must not be compromised, as it is essential for identifying and correcting AI-driven errors.

摘要

本研究评估了ChatGPT 4.0使用2017版美国放射学会甲状腺影像报告和数据系统(ACR-TI-RADS)标准解读甲状腺超声(US)报告的能力,并将其表现与不同水平的超声检查使用者进行比较。一组医学专家、一名缺乏经验的超声检查使用者以及ChatGPT 4.0分析了100份虚拟的甲状腺超声报告。从准确性、一致性以及诊断建议(包括细针穿刺活检(FNA)和随访)方面评估了ChatGPT的表现。ChatGPT在评估回声灶方面与专家意见高度一致,但在其分析中,在其他标准(如成分和边缘)方面的不一致也很明显。ChatGPT与专家之间的评分者间信度从中度到几乎完美不等,这反映了人工智能在实现专家级解读方面的潜力及其局限性。这位缺乏经验的超声检查使用者表现优于ChatGPT,与专家意见几乎完全一致,凸显了传统医学培训在TI-RADS等标准化风险分层工具中的关键作用。ChatGPT在推荐FNA方面具有较高的特异性,但与医学生相比,在随访方面的敏感性和特异性较低。这些发现强调了ChatGPT作为辅助诊断工具的潜力,而非取代人类专业知识。增强人工智能算法和训练可以提高ChatGPT的临床效用,从而在管理甲状腺结节和改善患者护理方面为临床医生提供更好的支持。本研究突出了人工智能在医学诊断中的前景和当前局限性,主张对其进行改进并将其整合到临床工作流程中。然而,研究强调传统临床培训绝不能受到影响,因为它对于识别和纠正人工智能驱动的错误至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e75/11899695/8f7b4dfe7353/diagnostics-15-00635-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验