• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用修订后的偏倚风险工具在随机对照试验中进行大语言模型辅助的偏倚风险评估:可用性研究

Large Language Model-Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study.

作者信息

Huang Jiajie, Lai Honghao, Zhao Weilong, Xia Danni, Bai Chunyang, Sun Mingyao, Liu Jianing, Liu Jiayi, Pan Bei, Tian Jinhui, Ge Long

机构信息

Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.

Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China.

出版信息

J Med Internet Res. 2025 Jun 24;27:e70450. doi: 10.2196/70450.

DOI:10.2196/70450
PMID:40554779
Abstract

BACKGROUND

The revised Risk-of-Bias tool (RoB2) overcomes the limitations of its predecessor but introduces new implementation challenges. Studies demonstrate low interrater reliability and substantial time requirements for RoB2 implementation. Large language models (LLMs) may assist in RoB2 implementation, although their effectiveness remains uncertain.

OBJECTIVE

This study aims to evaluate the accuracy of LLMs in RoB2 assessments to explore their potential as research assistants for bias evaluation.

METHODS

We systematically searched the Cochrane Library (through October 2023) for reviews using RoB2, categorized by interest in adhering or assignment. From 86 eligible reviews of randomized controlled trials (covering 1399 RCTs), we randomly selected 46 RCTs (23 per category). In addition, 3 experienced reviewers independently assessed all 46 RCTs using RoB2, recording assessment time for each trial. Reviewer judgments were reconciled through consensus. Furthermore, 6 RCTs (3 from each category) were randomly selected for prompt development and optimization. The remaining 40 trials established the internal validation standard, while Cochrane Reviews judgments served as external validation. Primary outcomes were extracted as reported in corresponding Cochrane Reviews. We calculated accuracy rates, Cohen κ, and time differentials.

RESULTS

We identified significant differences between Cochrane and reviewer judgments, particularly in domains 1, 4, and 5, likely due to different standards in assessing randomization and blinding. Among the 20 articles focusing on adhering, 18 Cochrane Reviews and 19 reviewer judgments classified them as "High risk," while assignment-focused RCTs showed more heterogeneous risk distribution. Compared with Cochrane Reviews, LLMs demonstrated accuracy rates of 57.5% and 70% for overall (assignment) and overall (adhering), respectively. When compared with reviewer judgments, LLMs' accuracy rates were 65% and 70% for these domains. The average accuracy rates for the remaining 6 domains were 65.2% (95% CI 57.6-72.7) against Cochrane Reviews and 74.2% (95% CI 64.7-83.9) against reviewers. At the signaling question level, LLMs achieved 83.2% average accuracy (95% CI 77.5-88.9), with accuracy exceeding 70% for most questions except 2.4 (assignment), 2.5 (assignment), 3.3, and 3.4. When domain judgments were derived from LLM-generated signaling questions using the RoB2 algorithm rather than direct LLM domain judgments, accuracy improved substantially for Domain 2 (adhering; 55-95) and overall (adhering; 70-90). LLMs demonstrated high consistency between iterations (average 85.2%, 95% CI 85.15-88.79) and completed assessments in 1.9 minutes versus 31.5 minutes for human reviewers (mean difference 29.6, 95% CI 25.6-33.6 minutes).

CONCLUSIONS

LLMs achieved commendable accuracy when guided by structured prompts, particularly through processing methodological details through structured reasoning. While not replacing human assessment, LLMs demonstrate strong potential for assisting RoB2 evaluations. Larger studies with improved prompting could enhance performance.

摘要

背景

修订后的偏倚风险工具(RoB2)克服了其前身的局限性,但带来了新的实施挑战。研究表明,RoB2实施过程中评分者间信度较低且耗时较长。大语言模型(LLM)可能有助于RoB2的实施,但其有效性仍不确定。

目的

本研究旨在评估大语言模型在RoB2评估中的准确性,以探索其作为偏倚评估研究助手的潜力。

方法

我们系统检索了Cochrane图书馆(截至2023年10月)中使用RoB2的综述,按对依从性或分配的关注进行分类。从86篇符合条件的随机对照试验综述(涵盖1399项随机对照试验)中,我们随机选择了46项随机对照试验(每个类别23项)。此外,3名经验丰富的评审员使用RoB2对所有46项随机对照试验进行独立评估,记录每项试验的评估时间。评审员的判断通过共识达成一致。此外,随机选择6项随机对照试验(每个类别3项)用于提示开发和优化。其余40项试验建立内部验证标准,而Cochrane综述的判断用作外部验证。主要结局按照相应Cochrane综述中的报告进行提取。我们计算了准确率、Cohen κ系数和时间差。

结果

我们发现Cochrane综述与评审员的判断之间存在显著差异,特别是在领域1、4和5中,这可能是由于评估随机化和盲法的标准不同。在20篇关注依从性的文章中,18篇Cochrane综述和19名评审员的判断将其归类为“高风险”,而以分配为重点的随机对照试验显示出更异质的风险分布。与Cochrane综述相比,大语言模型在总体(分配)和总体(依从性)方面的准确率分别为57.5%和70%。与评审员的判断相比,大语言模型在这些领域的准确率分别为65%和70%。其余6个领域相对于Cochrane综述的平均准确率为65.2%(95%CI 57.6 - 72.7),相对于评审员的平均准确率为74.2%(95%CI 64.7 - 83.9)。在信号问题层面,大语言模型的平均准确率达到83.2%(95%CI 77.5 - 88.9),除2.4(分配)、2.5(分配)、3.3和3.4外,大多数问题的准确率超过70%。当使用RoB2算法从大语言模型生成的信号问题中得出领域判断,而不是直接由大语言模型进行领域判断时,领域2(依从性;55 - 95)和总体(依从性;70 - 90)的准确率大幅提高。大语言模型在各轮迭代之间表现出高度一致性(平均85.2%,95%CI 85.15 - 88.79),完成评估用时1.9分钟,而人类评审员用时31.5分钟(平均差异29.6,95%CI 25.6 - 33.6分钟)。

结论

在结构化提示的指导下,大语言模型取得了值得称赞的准确率,特别是通过结构化推理处理方法细节。虽然不能取代人工评估,但大语言模型在协助RoB二世评估方面显示出强大潜力。通过改进提示进行的更大规模研究可能会提高其性能。

相似文献

1
Large Language Model-Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study.使用修订后的偏倚风险工具在随机对照试验中进行大语言模型辅助的偏倚风险评估:可用性研究
J Med Internet Res. 2025 Jun 24;27:e70450. doi: 10.2196/70450.
2
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.
3
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
4
Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物:网状Meta分析
Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.
5
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.
6
Treatment for women with postpartum iron deficiency anaemia.产后缺铁性贫血女性的治疗。
Cochrane Database Syst Rev. 2024 Dec 13;12(12):CD010861. doi: 10.1002/14651858.CD010861.pub3.
7
Exercise interventions on health-related quality of life for people with cancer during active treatment.积极治疗期间针对癌症患者健康相关生活质量的运动干预措施。
Cochrane Database Syst Rev. 2012 Aug 15;2012(8):CD008465. doi: 10.1002/14651858.CD008465.pub2.
8
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of topotecan for ovarian cancer.拓扑替康治疗卵巢癌的临床有效性和成本效益的快速系统评价。
Health Technol Assess. 2001;5(28):1-110. doi: 10.3310/hta5280.
9
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状荟萃分析。
Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.
10
Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。
Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

引用本文的文献

1
Correction: Large Language Model-Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Evaluation Study.更正:使用修订后的偏倚风险工具对随机对照试验进行大语言模型辅助的偏倚风险评估:评估研究。
J Med Internet Res. 2025 Jul 14;27:e80519. doi: 10.2196/80519.

本文引用的文献

1
Language models for data extraction and risk of bias assessment in complementary medicine.用于补充医学数据提取和偏倚风险评估的语言模型
NPJ Digit Med. 2025 Jan 31;8(1):74. doi: 10.1038/s41746-025-01457-w.
2
Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models.使用大型语言模型评估随机临床试验的偏倚风险。
JAMA Netw Open. 2024 May 1;7(5):e2412687. doi: 10.1001/jamanetworkopen.2024.12687.
3
Artificial intelligence bot ChatGPT in medical research: the potential game changer as a double-edged sword.
医学研究中的人工智能聊天机器人ChatGPT:作为一把双刃剑的潜在游戏规则改变者。
Knee Surg Sports Traumatol Arthrosc. 2023 Apr;31(4):1187-1189. doi: 10.1007/s00167-023-07355-6. Epub 2023 Feb 21.
4
Steps Ahead: Optimising physical activity in adults with cystic fibrosis: A pilot randomised trial using wearable technology, goal setting and text message feedback.超前一步:优化囊性纤维化成人的身体活动:使用可穿戴技术、目标设定和短信反馈的一项试点随机试验。
J Cyst Fibros. 2023 May;22(3):570-576. doi: 10.1016/j.jcf.2022.11.002. Epub 2022 Nov 17.
5
Adherence of systematic reviews to Cochrane RoB2 guidance was frequently poor: a meta epidemiological study.系统评价对Cochrane RoB2指南的遵循情况通常较差:一项Meta流行病学研究。
J Clin Epidemiol. 2022 Dec;152:47-55. doi: 10.1016/j.jclinepi.2022.09.003. Epub 2022 Sep 23.
6
Comparison of combination therapy of prednisolone and cyclosporine with corticosteroid pulse therapy in Vogt-Koyanagi-Harada disease.泼尼松龙和环孢素联合治疗与皮质类固醇脉冲疗法在 Vogt-Koyanagi-Harada 病中的比较。
Jpn J Ophthalmol. 2022 Mar;66(2):119-129. doi: 10.1007/s10384-021-00878-w. Epub 2021 Oct 24.
7
Reliability of the revised Cochrane risk-of-bias tool for randomised trials (RoB2) improved with the use of implementation instruction.修订版 Cochrane 随机试验偏倚风险评估工具(RoB2)在使用实施说明后,可靠性得到提高。
J Clin Epidemiol. 2022 Jan;141:99-105. doi: 10.1016/j.jclinepi.2021.09.021. Epub 2021 Sep 16.
8
A Multicentre, Randomised, Controlled Trial of a Combined Clinical Treatment for First-Episode Psychosis.多中心、随机、对照试验:一种针对首发精神病的联合临床治疗方法。
Int J Environ Res Public Health. 2021 Jul 6;18(14):7239. doi: 10.3390/ijerph18147239.
9
The rationale behind systematic reviews in clinical medicine: a conceptual framework.临床医学系统评价背后的基本原理:一个概念框架。
J Diabetes Metab Disord. 2021 Apr 8;20(1):919-929. doi: 10.1007/s40200-021-00773-8. eCollection 2021 Jun.
10
Branched-Chain Amino Acid Supplementation Does Not Preserve Lean Mass or Affect Metabolic Profile in Adults with Overweight or Obesity in a Randomized Controlled Weight Loss Intervention.支链氨基酸补充剂不能预防超重或肥胖成年人在随机对照减重干预中的瘦体重损失或影响代谢特征。
J Nutr. 2021 Apr 8;151(4):911-920. doi: 10.1093/jn/nxaa414.