• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT-4o在新生儿科偏倚风险评估中的应用:一项效度分析

ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis.

作者信息

Kuitunen Ilari, Nyrhi Lauri, De Luca Daniele

机构信息

Kuopio Pediatric Research Unit, University of Eastern Finland, Kuopio, Finland.

Department of Pediatrics and Neonatology, Kuopio University Hospital, Kuopio, Finland.

出版信息

Neonatology. 2025;122(3):360-365. doi: 10.1159/000544857. Epub 2025 Feb 25.

DOI:10.1159/000544857
PMID:39999815
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12129414/
Abstract

INTRODUCTION

Only a few studies have addressed the potential of large language models (LLMs) in risk-of-bias assessments and the results have been varying. The aim of this study was to analyze how well ChatGPT performs in risk-of-bias assessments of neonatal studies.

METHODS

We searched all Cochrane neonatal intervention reviews published in 2024 and extracted all risk-of-bias assessments. Then the full reports were retrieved and uploaded alongside the guidance to perform a Cochrane original risk-of-bias analysis in ChatGPT-4o. The concordance between the original assessment and that provided by ChatGPT-4o was evaluated by inter-class correlation coefficients and Cohen's kappa statistics (with 95% confidence intervals) for each risk-of-bias domain and for the overall assessment.

RESULTS

From 9 reviews, a total of 61 randomized studies were analyzed. A total of 427 judgments were compared. The overall κ was 0.43 (95% CI: 0.35-0.51) and the overall intraclass correlation coefficient was 0.65 (95% CI: 0.59-0.70). The Cohen's κ was assessed for each domain and the best agreement was observed in the allocation concealment (κ = 0.73, 95% CI: 0.55-0.90), whereas the poorest agreement was found in incomplete outcome data (κ = -0.03, 95% CI: -0.07-0.02).

CONCLUSION

ChatGPT-4o failed to achieve sufficient agreement in the risk-of-bias assessments. Future studies should examine whether the performance of other LLM would be better or whether the agreement in ChatGPT-4o could be further enhanced by better prompting. Currently, the use of ChatGPT-4o in risk-of-bias assessments should not be promoted.

摘要

引言

仅有少数研究探讨了大语言模型(LLMs)在偏倚风险评估中的潜力,且结果各异。本研究旨在分析ChatGPT在新生儿研究的偏倚风险评估中表现如何。

方法

我们检索了2024年发表的所有Cochrane新生儿干预综述,并提取了所有偏倚风险评估。然后检索完整报告,并与指南一起上传,以便在ChatGPT-4o中进行Cochrane原始偏倚风险分析。通过组内相关系数和Cohen's kappa统计量(95%置信区间)对每个偏倚风险领域和总体评估,评估原始评估与ChatGPT-4o提供的评估之间的一致性。

结果

从9篇综述中,共分析了61项随机研究。总共比较了427项判断。总体κ为0.43(95%CI:0.35 - 0.51),总体组内相关系数为0.65(95%CI:0.59 - 0.70)。对每个领域评估了Cohen's κ,在分配隐藏方面观察到最佳一致性(κ = 0.73,95%CI:0.55 - 0.90),而在不完整结局数据方面一致性最差(κ = -0.03,95%CI:-0.07 - 0.02)。

结论

ChatGPT-4o在偏倚风险评估中未能达成足够的一致性。未来研究应考察其他大语言模型的表现是否会更好,或者是否可以通过更好的提示进一步提高ChatGPT-4o中的一致性。目前,不应推广在偏倚风险评估中使用ChatGPT-4o。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ab5/12129414/b136b1bc0282/neo-2025-0122-0003-544857_F02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ab5/12129414/f467d1183910/neo-2025-0122-0003-544857_F01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ab5/12129414/b136b1bc0282/neo-2025-0122-0003-544857_F02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ab5/12129414/f467d1183910/neo-2025-0122-0003-544857_F01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ab5/12129414/b136b1bc0282/neo-2025-0122-0003-544857_F02.jpg

相似文献

1
ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis.ChatGPT-4o在新生儿科偏倚风险评估中的应用:一项效度分析
Neonatology. 2025;122(3):360-365. doi: 10.1159/000544857. Epub 2025 Feb 25.
2
Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。
Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
ChatGPT as an effective tool for quality evaluation of radiomics research.ChatGPT作为一种用于影像组学研究质量评估的有效工具。
Eur Radiol. 2025 Apr;35(4):2030-2042. doi: 10.1007/s00330-024-11122-7. Epub 2024 Oct 15.
5
Chasing sleep physicians: ChatGPT-4o on the interpretation of polysomnographic results.追寻睡眠医学专家:ChatGPT-4o对多导睡眠图结果的解读
Eur Arch Otorhinolaryngol. 2025 Mar;282(3):1631-1639. doi: 10.1007/s00405-024-08985-3. Epub 2024 Oct 20.
6
Can the large language model ChatGPT-4omni predict outcomes in adult patients with status epilepticus?大语言模型ChatGPT-4omni能否预测成人癫痫持续状态患者的预后?
Epilepsia. 2025 Mar;66(3):674-685. doi: 10.1111/epi.18215. Epub 2024 Dec 26.
7
Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery.在耳鼻喉科、头颈外科中,评估本地运行和基于网络的大语言模型与人类委员会建议的决策情况。
Eur Arch Otorhinolaryngol. 2025 Mar;282(3):1593-1607. doi: 10.1007/s00405-024-09153-3. Epub 2025 Jan 10.
8
ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports.ChatGPT与Gemini:在根据放射学报告进行Lung-RADS评分分配中的准确性和效率比较
Clin Imaging. 2025 May;121:110455. doi: 10.1016/j.clinimag.2025.110455. Epub 2025 Mar 13.
9
ChatGPT-4o can serve as the second rater for data extraction in systematic reviews.ChatGPT-4o可作为系统评价中数据提取的第二评估者。
PLoS One. 2025 Jan 7;20(1):e0313401. doi: 10.1371/journal.pone.0313401. eCollection 2025.
10
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

引用本文的文献

1
Human Versus Artificial Intelligence: Comparing Cochrane Authors' and ChatGPT's Risk of Bias Assessments.人类与人工智能:比较Cochrane作者和ChatGPT的偏倚风险评估
Cochrane Evid Synth Methods. 2025 Aug 31;3(5):e70044. doi: 10.1002/cesm.70044. eCollection 2025 Sep.

本文引用的文献

1
Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments.评估ChatGPT-4o在偏倚风险评估中的表现。
J Evid Based Med. 2024 Dec;17(4):700-702. doi: 10.1111/jebm.12662. Epub 2024 Dec 15.
2
Incorrect blinding assessments are common in meta-analyses published in high impact journals.在高影响力期刊上发表的荟萃分析中,不正确的盲法评估很常见。
J Evid Based Med. 2024 Sep;17(3):471-473. doi: 10.1111/jebm.12636. Epub 2024 Aug 29.
3
Disagreements in risk of bias assessment for randomized controlled trials in hypertension-related Cochrane reviews.
高血压相关 Cochrane 综述中随机对照试验偏倚风险评估的分歧。
Trials. 2024 Jun 21;25(1):405. doi: 10.1186/s13063-024-08145-2.
4
Blinding Assessments in Neonatal Ventilation Meta-Analyses: A Systematic Meta-Epidemiological Review.新生儿通气荟萃分析中的盲法评估:一项系统的Meta-流行病学综述
Neonatology. 2024;121(6):659-666. doi: 10.1159/000539203. Epub 2024 Jun 11.
5
Pilot study on large language models for risk-of-bias assessments in systematic reviews: A(I) new type of bias?系统评价中用于偏倚风险评估的大语言模型的初步研究:一种新型偏倚?
BMJ Evid Based Med. 2025 Jan 22;30(1):71-74. doi: 10.1136/bmjebm-2024-112990.
6
Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models.使用大型语言模型评估随机临床试验的偏倚风险。
JAMA Netw Open. 2024 May 1;7(5):e2412687. doi: 10.1001/jamanetworkopen.2024.12687.
7
Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment.将大型语言模型集成到系统评价中:使用 ROBINS-I 进行偏倚风险评估的框架和案例研究。
BMJ Evid Based Med. 2024 Nov 22;29(6):394-398. doi: 10.1136/bmjebm-2023-112597.
8
Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.提示工程在与大语言模型基于证据的指南保持一致性和可靠性方面。
NPJ Digit Med. 2024 Feb 20;7(1):41. doi: 10.1038/s41746-024-01029-4.
9
Most Cochrane systematic reviews and protocols did not adhere to the Cochrane's risk of bias 2.0 tool.大多数 Cochrane 系统评价和方案并未遵循 Cochrane 的偏倚风险 2.0 工具。
Rev Assoc Med Bras (1992). 2023 Feb 20;69(3):469-472. doi: 10.1590/1806-9282.20221593. eCollection 2023.
10
The revised Cochrane risk of bias tool for randomized trials (RoB 2) showed low interrater reliability and challenges in its application.修订后的 Cochrane 随机对照试验偏倚风险工具(RoB 2)显示出较低的评分者间可靠性和应用方面的挑战。
J Clin Epidemiol. 2020 Oct;126:37-44. doi: 10.1016/j.jclinepi.2020.06.015. Epub 2020 Jun 18.