ChatGPT-4o在新生儿科偏倚风险评估中的应用：一项效度分析

ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis.

作者信息

Kuitunen Ilari, Nyrhi Lauri, De Luca Daniele

机构信息

Kuopio Pediatric Research Unit, University of Eastern Finland, Kuopio, Finland.

Department of Pediatrics and Neonatology, Kuopio University Hospital, Kuopio, Finland.

出版信息

Neonatology. 2025;122(3):360-365. doi: 10.1159/000544857. Epub 2025 Feb 25.

DOI:10.1159/000544857

PMID:39999815

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12129414/

Abstract

INTRODUCTION

Only a few studies have addressed the potential of large language models (LLMs) in risk-of-bias assessments and the results have been varying. The aim of this study was to analyze how well ChatGPT performs in risk-of-bias assessments of neonatal studies.

METHODS

We searched all Cochrane neonatal intervention reviews published in 2024 and extracted all risk-of-bias assessments. Then the full reports were retrieved and uploaded alongside the guidance to perform a Cochrane original risk-of-bias analysis in ChatGPT-4o. The concordance between the original assessment and that provided by ChatGPT-4o was evaluated by inter-class correlation coefficients and Cohen's kappa statistics (with 95% confidence intervals) for each risk-of-bias domain and for the overall assessment.

RESULTS

From 9 reviews, a total of 61 randomized studies were analyzed. A total of 427 judgments were compared. The overall κ was 0.43 (95% CI: 0.35-0.51) and the overall intraclass correlation coefficient was 0.65 (95% CI: 0.59-0.70). The Cohen's κ was assessed for each domain and the best agreement was observed in the allocation concealment (κ = 0.73, 95% CI: 0.55-0.90), whereas the poorest agreement was found in incomplete outcome data (κ = -0.03, 95% CI: -0.07-0.02).

CONCLUSION

ChatGPT-4o failed to achieve sufficient agreement in the risk-of-bias assessments. Future studies should examine whether the performance of other LLM would be better or whether the agreement in ChatGPT-4o could be further enhanced by better prompting. Currently, the use of ChatGPT-4o in risk-of-bias assessments should not be promoted.

摘要

引言

仅有少数研究探讨了大语言模型（LLMs）在偏倚风险评估中的潜力，且结果各异。本研究旨在分析ChatGPT在新生儿研究的偏倚风险评估中表现如何。

方法

我们检索了2024年发表的所有Cochrane新生儿干预综述，并提取了所有偏倚风险评估。然后检索完整报告，并与指南一起上传，以便在ChatGPT-4o中进行Cochrane原始偏倚风险分析。通过组内相关系数和Cohen's kappa统计量（95%置信区间）对每个偏倚风险领域和总体评估，评估原始评估与ChatGPT-4o提供的评估之间的一致性。

结果

从9篇综述中，共分析了61项随机研究。总共比较了427项判断。总体κ为0.43（95%CI：0.35 - 0.51），总体组内相关系数为0.65（95%CI：0.59 - 0.70）。对每个领域评估了Cohen's κ，在分配隐藏方面观察到最佳一致性（κ = 0.73，95%CI：0.55 - 0.90），而在不完整结局数据方面一致性最差（κ = -0.03，95%CI：-0.07 - 0.02）。

结论

ChatGPT-4o在偏倚风险评估中未能达成足够的一致性。未来研究应考察其他大语言模型的表现是否会更好，或者是否可以通过更好的提示进一步提高ChatGPT-4o中的一致性。目前，不应推广在偏倚风险评估中使用ChatGPT-4o。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

ChatGPT-4o在新生儿科偏倚风险评估中的应用：一项效度分析

ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSION

引言

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

ChatGPT-4o在新生儿科偏倚风险评估中的应用：一项效度分析

ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSION

引言

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献