Wakonig Katharina Margherita, Barisch Simon, Kozarzewski Leonard, Dommerich Steffen, Lerchbaumer Markus Herbert
Department of Otorhinolaryngology, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Campus Virchow Klinikum and Campus Charité Mitte, Charitéplatz 1, 10117 Berlin, Germany.
Department of Endocrinology, Diabetes and Metabolism, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Charitéplatz 1, 10117 Berlin, Germany.
Diagnostics (Basel). 2025 Mar 6;15(5):635. doi: 10.3390/diagnostics15050635.
This study evaluates ChatGPT 4.0's ability to interpret thyroid ultrasound (US) reports using ACR-TI-RADS 2017 criteria, comparing its performance with different levels of US users. A team of medical experts, an inexperienced US user, and ChatGPT 4.0 analyzed 100 fictitious thyroid US reports. ChatGPT's performance was assessed for accuracy, consistency, and diagnostic recommendations, including fine-needle aspirations (FNA) and follow-ups. ChatGPT demonstrated substantial agreement with experts in assessing echogenic foci, but inconsistencies in other criteria, such as composition and margins, were evident in both its analyses. Interrater reliability between ChatGPT and experts ranged from moderate to almost perfect, reflecting AI's potential but also its limitations in achieving expert-level interpretations. The inexperienced US user outperformed ChatGPT with a nearly perfect agreement with the experts, highlighting the critical role of traditional medical training in standardized risk stratification tools such as TI-RADS. ChatGPT showed high specificity in recommending FNAs but lower sensitivity and specificity for follow-ups compared to the medical student. These findings emphasize ChatGPT's potential as a supportive diagnostic tool rather than a replacement for human expertise. Enhancing AI algorithms and training could improve ChatGPT's clinical utility, enabling better support for clinicians in managing thyroid nodules and improving patient care. This study highlights both the promise and current limitations of AI in medical diagnostics, advocating for its refinement and integration into clinical workflows. However, it emphasizes that traditional clinical training must not be compromised, as it is essential for identifying and correcting AI-driven errors.
本研究评估了ChatGPT 4.0使用2017版美国放射学会甲状腺影像报告和数据系统(ACR-TI-RADS)标准解读甲状腺超声(US)报告的能力,并将其表现与不同水平的超声检查使用者进行比较。一组医学专家、一名缺乏经验的超声检查使用者以及ChatGPT 4.0分析了100份虚拟的甲状腺超声报告。从准确性、一致性以及诊断建议(包括细针穿刺活检(FNA)和随访)方面评估了ChatGPT的表现。ChatGPT在评估回声灶方面与专家意见高度一致,但在其分析中,在其他标准(如成分和边缘)方面的不一致也很明显。ChatGPT与专家之间的评分者间信度从中度到几乎完美不等,这反映了人工智能在实现专家级解读方面的潜力及其局限性。这位缺乏经验的超声检查使用者表现优于ChatGPT,与专家意见几乎完全一致,凸显了传统医学培训在TI-RADS等标准化风险分层工具中的关键作用。ChatGPT在推荐FNA方面具有较高的特异性,但与医学生相比,在随访方面的敏感性和特异性较低。这些发现强调了ChatGPT作为辅助诊断工具的潜力,而非取代人类专业知识。增强人工智能算法和训练可以提高ChatGPT的临床效用,从而在管理甲状腺结节和改善患者护理方面为临床医生提供更好的支持。本研究突出了人工智能在医学诊断中的前景和当前局限性,主张对其进行改进并将其整合到临床工作流程中。然而,研究强调传统临床培训绝不能受到影响,因为它对于识别和纠正人工智能驱动的错误至关重要。