Suppr超能文献

甲状腺影像报告和数据系统(TIRADS)中人工智能模型在甲状腺结节评估方面的观察者间一致性。

Interobserver agreement between artificial intelligence models in the thyroid imaging and reporting data system (TIRADS) assessment of thyroid nodules.

作者信息

Leoncini Andrea, Trimboli Pierpaolo

机构信息

Clinic for Radiology, Imaging Institute of Southern Switzerland, Ente Ospedaliero Cantonale (EOC), Bellinzona, Switzerland.

Thyroid Unit, Clinic for Endocrinology and Diabetology, Ente Ospedaliero Cantonale (EOC), Bellinzona, Switzerland.

出版信息

Endocrine. 2025 May 15. doi: 10.1007/s12020-025-04272-1.

Abstract

BACKGROUND

As ultrasound (US) is the most accurate tool for assessing the thyroid nodule (TN) risk of malignancy (RoM), international societies have published various Thyroid Imaging and Reporting Data Systems (TIRADSs). With the recent advent of artificial intelligence (AI), clinicians and researchers should ask themselves how AI could interpret the terminology of the TIRADSs and whether or not AIs agree in the risk assessment of TNs. The study aim was to analyze the interobserver agreement (IOA) between AIs in assessing the RoM of TNs across various TIRADSs categories using a cases series created combining TIRADSs descriptors.

METHODS

ChatGPT, Google Gemini, and Claude were compared. ACR-TIRADS, EU-TIRADS, and K-TIRADS, were employed to evaluate the AI assessment. Multiple written scenarios for the three TIRADS were created, the cases were evaluated by the three AIs, and their assessments were analyzed and compared. The IOA was estimated by comparing the kappa (κ) values.

RESULTS

Ninety scenarios were created. With ACR-TIRADS the IOA analysis gave κ = 0.58 between ChatGPT and Gemini, 0.53 between ChatGPT and Claude, and 0.90 between Gemini and Claude. With EU-TIRADS it was observed κ value = 0.73 between ChatGPT and Gemini, 0.62 between ChatGPT and Claude, and 0.72 between Gemini and Claude. With K-TIRADS it was found κ = 0.88 between ChatGPT and Gemini, 0.70 between ChatGPT and Claude, and 0.61 between Gemini and Claude.

CONCLUSION

This study found that there were non-negligible variability between the three AIs. Clinicians and patients should be aware of these new findings.

摘要

背景

由于超声(US)是评估甲状腺结节(TN)恶性风险(RoM)最准确的工具,国际社会已发布了各种甲状腺影像报告和数据系统(TIRADSs)。随着人工智能(AI)的近期出现,临床医生和研究人员应该问问自己,AI如何解释TIRADSs的术语,以及AI在TN的风险评估中是否一致。本研究的目的是使用结合TIRADSs描述符创建的病例系列,分析AI之间在评估不同TIRADSs类别TN的RoM时的观察者间一致性(IOA)。

方法

比较了ChatGPT、谷歌Gemini和Claude。采用美国放射学会(ACR)-TIRADS、欧洲(EU)-TIRADS和韩国(K)-TIRADS来评估AI评估。为这三种TIRADS创建了多个书面场景,由这三种AI对病例进行评估,并对它们的评估进行分析和比较。通过比较kappa(κ)值来估计IOA。

结果

创建了90个场景。对于ACR-TIRADS,IOA分析显示ChatGPT和Gemini之间的κ = 0.58,ChatGPT和Claude之间的κ = 0.53,Gemini和Claude之间的κ = 0.90。对于EU-TIRADS,观察到ChatGPT和Gemini之间的κ值 = 0.73,ChatGPT和Claude之间的κ = 0.62,Gemini和Claude之间的κ = 0.72。对于K-TIRADS,发现ChatGPT和Gemini之间的κ = 0.88,ChatGPT和Claude之间的κ = 0.70,Gemini和Claude之间的κ = 0.61。

结论

本研究发现这三种AI之间存在不可忽视的变异性。临床医生和患者应了解这些新发现。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验