Mese Ismail, Kocak Burak
Department of Radiology, Erenkoy Mental Health and Neurology Training and Research Hospital, University of Health Sciences, Istanbul, Turkey.
Department of Radiology, Basaksehir Cam and Sakura City Hospital, University of Health Sciences, Istanbul, Turkey.
Eur Radiol. 2025 Apr;35(4):2030-2042. doi: 10.1007/s00330-024-11122-7. Epub 2024 Oct 15.
This study aimed to evaluate the effectiveness of ChatGPT-4o in assessing the methodological quality of radiomics research using the radiomics quality score (RQS) compared to human experts.
Published in European Radiology, European Radiology Experimental, and Insights into Imaging between 2023 and 2024, open-access and peer-reviewed radiomics research articles with creative commons attribution license (CC-BY) were included in this study. Pre-prints from MedRxiv were also included to evaluate potential peer-review bias. Using the RQS, each study was independently assessed twice by ChatGPT-4o and by two radiologists with consensus.
In total, 52 open-access and peer-reviewed articles were included in this study. Both ChatGPT-4o evaluation (average of two readings) and human experts had a median RQS of 14.5 (40.3% percentage score) (p > 0.05). Pairwise comparisons revealed no statistically significant difference between the readings of ChatGPT and human experts (corrected p > 0.05). The intraclass correlation coefficient for intra-rater reliability of ChatGPT-4o was 0.905 (95% CI: 0.840-0.944), and those for inter-rater reliability with human experts for each evaluation of ChatGPT-4o were 0.859 (95% CI: 0.756-0.919) and 0.914 (95% CI: 0.855-0.949), corresponding to good to excellent reliability for all. The evaluation by ChatGPT-4o took less time (2.9-3.5 min per article) compared to human experts (13.9 min per article by one reader). Item-wise reliability analysis showed ChatGPT-4o maintained consistently high reliability across almost all RQS items.
ChatGPT-4o provides reliable and efficient assessments of radiomics research quality. Its evaluations closely align with those of human experts and reduce evaluation time.
Question Is ChatGPT effective and reliable in evaluating radiomics research quality based on RQS? Findings ChatGPT-4o showed high reliability and efficiency, with evaluations closely matching human experts. It can considerably reduce the time required for radiomics research quality assessment. Clinical relevance ChatGPT-4o offers a quick and reliable automated alternative for evaluating the quality of radiomics research, with the potential to assess radiomics research at a large scale in the future.
本研究旨在评估ChatGPT-4o与人类专家相比,在使用放射组学质量评分(RQS)评估放射组学研究方法质量方面的有效性。
本研究纳入了2023年至2024年期间发表在《欧洲放射学》《欧洲放射学实验》和《影像洞察》上的、具有知识共享署名许可(CC-BY)的开放获取且经过同行评审的放射组学研究文章。还纳入了MedRxiv的预印本以评估潜在的同行评审偏差。使用RQS,ChatGPT-4o和两位达成共识的放射科医生对每项研究进行了两次独立评估。
本研究共纳入52篇开放获取且经过同行评审的文章。ChatGPT-4o评估(两次阅读的平均值)和人类专家的RQS中位数均为14.5(百分比得分40.3%)(p>0.05)。两两比较显示ChatGPT和人类专家的阅读结果之间无统计学显著差异(校正p>0.05)。ChatGPT-4o评分者内信度的组内相关系数为0.905(95%CI:0.840-0.944),ChatGPT-4o每次评估与人类专家的评分者间信度组内相关系数分别为0.859(95%CI:0.756-0.919)和0.914(95%CI:0.855-0.949),均对应良好到优秀的信度。与人类专家(一位读者每篇文章13.9分钟)相比,ChatGPT-4o的评估耗时更少(每篇文章2.9-3.5分钟)。逐项信度分析表明,ChatGPT-4o在几乎所有RQS项目上均保持一致的高信度。
ChatGPT-4o能对放射组学研究质量进行可靠且高效的评估。其评估结果与人类专家的评估结果高度一致,并减少了评估时间。
问题ChatGPT基于RQS评估放射组学研究质量是否有效且可靠?发现ChatGPT-4o显示出高信度和效率,评估结果与人类专家的评估结果非常匹配。它可以大幅减少放射组学研究质量评估所需的时间。临床意义ChatGPT-4o为评估放射组学研究质量提供了一种快速且可靠的自动化替代方法,未来有可能大规模评估放射组学研究。