Kuitunen Ilari, Nyrhi Lauri, De Luca Daniele
Kuopio Pediatric Research Unit, University of Eastern Finland, Kuopio, Finland.
Department of Pediatrics and Neonatology, Kuopio University Hospital, Kuopio, Finland.
Neonatology. 2025;122(3):360-365. doi: 10.1159/000544857. Epub 2025 Feb 25.
Only a few studies have addressed the potential of large language models (LLMs) in risk-of-bias assessments and the results have been varying. The aim of this study was to analyze how well ChatGPT performs in risk-of-bias assessments of neonatal studies.
We searched all Cochrane neonatal intervention reviews published in 2024 and extracted all risk-of-bias assessments. Then the full reports were retrieved and uploaded alongside the guidance to perform a Cochrane original risk-of-bias analysis in ChatGPT-4o. The concordance between the original assessment and that provided by ChatGPT-4o was evaluated by inter-class correlation coefficients and Cohen's kappa statistics (with 95% confidence intervals) for each risk-of-bias domain and for the overall assessment.
From 9 reviews, a total of 61 randomized studies were analyzed. A total of 427 judgments were compared. The overall κ was 0.43 (95% CI: 0.35-0.51) and the overall intraclass correlation coefficient was 0.65 (95% CI: 0.59-0.70). The Cohen's κ was assessed for each domain and the best agreement was observed in the allocation concealment (κ = 0.73, 95% CI: 0.55-0.90), whereas the poorest agreement was found in incomplete outcome data (κ = -0.03, 95% CI: -0.07-0.02).
ChatGPT-4o failed to achieve sufficient agreement in the risk-of-bias assessments. Future studies should examine whether the performance of other LLM would be better or whether the agreement in ChatGPT-4o could be further enhanced by better prompting. Currently, the use of ChatGPT-4o in risk-of-bias assessments should not be promoted.
仅有少数研究探讨了大语言模型(LLMs)在偏倚风险评估中的潜力,且结果各异。本研究旨在分析ChatGPT在新生儿研究的偏倚风险评估中表现如何。
我们检索了2024年发表的所有Cochrane新生儿干预综述,并提取了所有偏倚风险评估。然后检索完整报告,并与指南一起上传,以便在ChatGPT-4o中进行Cochrane原始偏倚风险分析。通过组内相关系数和Cohen's kappa统计量(95%置信区间)对每个偏倚风险领域和总体评估,评估原始评估与ChatGPT-4o提供的评估之间的一致性。
从9篇综述中,共分析了61项随机研究。总共比较了427项判断。总体κ为0.43(95%CI:0.35 - 0.51),总体组内相关系数为0.65(95%CI:0.59 - 0.70)。对每个领域评估了Cohen's κ,在分配隐藏方面观察到最佳一致性(κ = 0.73,95%CI:0.55 - 0.90),而在不完整结局数据方面一致性最差(κ = -0.03,95%CI:-0.07 - 0.02)。
ChatGPT-4o在偏倚风险评估中未能达成足够的一致性。未来研究应考察其他大语言模型的表现是否会更好,或者是否可以通过更好的提示进一步提高ChatGPT-4o中的一致性。目前,不应推广在偏倚风险评估中使用ChatGPT-4o。