Suppr超能文献

ChatGPT-4在根据超声图像估计甲状腺结节特征和癌症风险方面的准确性。

ChatGPT-4's Accuracy in Estimating Thyroid Nodule Features and Cancer Risk From Ultrasound Images.

作者信息

Cabezas Esteban, Toro-Tobon David, Johnson Thomas, Álvarez Marco, Azadi Javad R, Gonzalez-Velasquez Camilo, Singh Ospina Naykky, Ponce Oscar J, Branda Megan E, Brito Juan P

机构信息

Knowledge and Evaluation Research Unit, Division of Endocrinology, Diabetes, Metabolism, and Nutrition, Department of Medicine, Mayo Clinic, Rochester, Minnesota.

Knowledge and Evaluation Research Unit, Division of Endocrinology, Diabetes, Metabolism, and Nutrition, Department of Medicine, Mayo Clinic, Rochester, Minnesota; Division of Endocrinology, Diabetes, Metabolism, and Nutrition, Mayo Clinic, Rochester, Minnesota.

出版信息

Endocr Pract. 2025 Mar 24. doi: 10.1016/j.eprac.2025.03.008.

Abstract

OBJECTIVE

To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS).

METHODS

This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound images sourced from 3 open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard.

RESULTS

GPT-4 demonstrated high specificity but low sensitivity when assessing complete thyroid ultrasound images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (benign), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk moderately suspicious category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs 46%), and echogenic foci (3% vs 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images.

CONCLUSION

While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.

摘要

目的

根据美国放射学会甲状腺影像报告和数据系统(TI-RADS),评估GPT-4和GPT-4o从甲状腺结节超声图像中准确识别特征和类别的性能。

方法

这项比较验证研究于2023年10月至2024年5月进行,使用了从3个开放获取数据库中获取的202张甲状腺超声图像。专家放射科医生对每张图像的完整版本和裁剪版本进行独立评估,以建立TI-RADS特征和类别的参考标准。促使GPT-4和GPT-4o分析每张图像,并将它们生成的TI-RADS输出与参考标准进行比较。

结果

在评估大多数TI-RADS类别的完整甲状腺超声图像时,GPT-4表现出高特异性但低敏感性,导致总体准确率参差不齐。对于低风险结节(良性),GPT-4的敏感性为25.0%,特异性为99.5%,准确率为93.6%。相比之下,在较高风险的中度可疑类别中,GPT-4的敏感性为75%,特异性为22.2%,准确率为42.1%。虽然GPT-4有效地识别了边缘光滑等特征(73%对参考标准的65%),但它难以识别等回声性(5%对46%)和回声灶(3%对27%)等其他特征。使用GPT-4和GPT-4o对裁剪图像的评估遵循类似模式,尽管有轻微偏差,表明与GPT-4对完整图像的评估相比性能有所下降。

结论

虽然GPT-4和GPT-4o模型显示出提高甲状腺结节分类效率的潜力,但其性能仍不理想,尤其是在高风险类别中。在临床应用之前,有必要对这些模型进行进一步优化和验证。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验