• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于随机对照试验的GPT?利用人工智能确定对临床试验报告指南的遵循情况。

GPT for RCTs? Using AI to determine adherence to clinical trial reporting guidelines.

作者信息

Wrightson James G, Blazey Paul, Moher David, Khan Karim M, Ardern Clare L

机构信息

Department of Physical Therapy, The University of British Columbia Faculty of Medicine, Vancouver, British Columbia, Canada.

Centre for Aging SMART, The University of British Columbia, Vancouver, British Columbia, Canada.

出版信息

BMJ Open. 2025 Mar 18;15(3):e088735. doi: 10.1136/bmjopen-2024-088735.

DOI:10.1136/bmjopen-2024-088735
PMID:40107689
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11927406/
Abstract

OBJECTIVES

Adherence to established reporting guidelines can improve clinical trial reporting standards, but attempts to improve adherence have produced mixed results. This exploratory study aimed to determine how accurate a large language model generative artificial intelligence system (AI-LLM) was for determining reporting guideline compliance in a sample of sports medicine clinical trial reports.

DESIGN

This study was an exploratory retrospective data analysis. OpenAI GPT-4 and Meta Llama 2 AI-LLM were evaluated for their ability to determine reporting guideline adherence in a sample of sports medicine and exercise science clinical trial reports.

SETTING

Academic research institution.

PARTICIPANTS

The study sample included 113 published sports medicine and exercise science clinical trial papers. For each paper, the GPT-4 Turbo and Llama 2 70B models were prompted to answer a series of nine reporting guideline questions about the text of the article. The GPT-4 Vision model was prompted to answer two additional reporting guideline questions about the participant flow diagram in a subset of articles. The dataset was randomly split (80/20) into a TRAIN and TEST dataset. Hyperparameter and fine-tuning were performed using the TRAIN dataset. The Llama 2 model was fine-tuned using the data from the GPT-4 Turbo analysis of the TRAIN dataset.

PRIMARY AND SECONDARY OUTCOME MEASURES

The primary outcome was the F1-score, a measure of model performance on the TEST dataset. The secondary outcome was the model's classification accuracy (%).

RESULTS

Across all questions about the article text, the GPT-4 Turbo AI-LLM demonstrated acceptable performance (F1-score=0.89, accuracy (95% CI) = 90% (85% to 94%)). Accuracy for all reporting guidelines was >80%. The Llama 2 model accuracy was initially poor (F1-score=0.63, accuracy (95% CI) = 64% (57% to 71%)) and improved with fine-tuning (F1-score=0.84, accuracy (95% CI) = 83% (77% to 88%)). The GPT-4 Vision model accurately identified all participant flow diagrams (accuracy (95% CI) = 100% (89% to 100%)) but was less accurate at identifying when details were missing from the flow diagram (accuracy (95% CI) = 57% (39% to 73%)).

CONCLUSIONS

Both the GPT-4 and fine-tuned Llama 2 AI-LLMs showed promise as tools for assessing reporting guideline compliance. Next steps should include developing an efficient, open-source AI-LLM and exploring methods to improve model accuracy.

摘要

目的

遵循既定的报告指南可提高临床试验报告标准,但提高遵循度的尝试结果不一。这项探索性研究旨在确定大型语言模型生成式人工智能系统(AI-LLM)在确定运动医学临床试验报告样本中报告指南合规性方面的准确程度。

设计

本研究为探索性回顾性数据分析。对OpenAI GPT-4和Meta Llama 2 AI-LLM在确定运动医学和运动科学临床试验报告样本中报告指南遵循度的能力进行了评估。

设置

学术研究机构。

参与者

研究样本包括113篇已发表的运动医学和运动科学临床试验论文。对于每篇论文,提示GPT-4 Turbo和Llama 2 70B模型回答一系列关于文章文本的九个报告指南问题。提示GPT-4 Vision模型回答关于文章子集中参与者流程图的另外两个报告指南问题。数据集随机分为(80/20)训练集和测试集。使用训练集进行超参数调整和微调。Llama 2模型使用训练集的GPT-4 Turbo分析数据进行微调。

主要和次要结局指标

主要结局为F1分数,这是测试集上模型性能的一项指标。次要结局为模型的分类准确率(%)。

结果

在所有关于文章文本的问题中,GPT-4 Turbo AI-LLM表现出可接受的性能(F1分数=0.89,准确率(95%CI)=90%(85%至94%))。所有报告指南的准确率均>80%。Llama 2模型的准确率最初较差(F1分数=0.63,准确率(95%CI)=64%(57%至71%)),经微调后有所提高(F1分数=0.84,准确率(95%CI)=83%(77%至88%))。GPT-4 Vision模型准确识别了所有参与者流程图(准确率(95%CI)=100%(89%至100%)),但在识别流程图中何时缺少细节方面准确性较低(准确率(95%CI)=57%(39%至73%))。

结论

GPT-4和经过微调的Llama 2 AI-LLM都有望成为评估报告指南合规性的工具。下一步应包括开发一个高效的开源AI-LLM,并探索提高模型准确性的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c319/11927406/b10bd39e3994/bmjopen-15-3-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c319/11927406/b10bd39e3994/bmjopen-15-3-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c319/11927406/b10bd39e3994/bmjopen-15-3-g001.jpg

相似文献

1
GPT for RCTs? Using AI to determine adherence to clinical trial reporting guidelines.用于随机对照试验的GPT?利用人工智能确定对临床试验报告指南的遵循情况。
BMJ Open. 2025 Mar 18;15(3):e088735. doi: 10.1136/bmjopen-2024-088735.
2
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。
Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models.使用适配的开源和闭源大语言模型评估影像检查申请单所附临床病史的完整性
Radiology. 2025 Feb;314(2):e241051. doi: 10.1148/radiol.241051.
5
GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3.基于微调的Llama 3由GPT驱动的放射学报告生成
Bioengineering (Basel). 2024 Oct 18;11(10):1043. doi: 10.3390/bioengineering11101043.
6
Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports.比较商用和开源大语言模型在标注胸部 X 光报告中的表现。
Radiology. 2024 Oct;313(1):e241139. doi: 10.1148/radiol.241139.
7
Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media: Evaluation Study.大型语言模型在模拟人类专家对社交媒体上关于加热烟草制品的公众情绪评估方面的准确性:评估研究。
J Med Internet Res. 2025 Mar 4;27:e63631. doi: 10.2196/63631.
8
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.人工智能解决方案在医疗检查和证书中的准确性和能力:系统评价和荟萃分析。
J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532.
9
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
10
Classifying the Information Needs of Survivors of Domestic Violence in Online Health Communities Using Large Language Models: Prediction Model Development and Evaluation Study.使用大语言模型对在线健康社区中家庭暴力幸存者的信息需求进行分类:预测模型的开发与评估研究
J Med Internet Res. 2025 May 12;27:e65397. doi: 10.2196/65397.

引用本文的文献

1
Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review.随机临床试验文章报告质量的大语言模型分析:一项系统评价
JAMA Netw Open. 2025 Aug 1;8(8):e2529418. doi: 10.1001/jamanetworkopen.2025.29418.
2
Large language models in clinical nutrition: an overview of its applications, capabilities, limitations, and potential future prospects.临床营养中的大语言模型:其应用、能力、局限性及潜在未来前景概述
Front Nutr. 2025 Aug 7;12:1635682. doi: 10.3389/fnut.2025.1635682. eCollection 2025.
3
Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR).

本文引用的文献

1
Endorsements of five reporting guidelines for biomedical research by journals of prominent publishers.知名出版商期刊对五项生物医学研究报告规范的认可。
PLoS One. 2024 Feb 29;19(2):e0299806. doi: 10.1371/journal.pone.0299806. eCollection 2024.
2
Up Front and Open? Shrouded in Secrecy? Or Somewhere in Between? A Meta-Research Systematic Review of Open Science Practices in Sport Medicine Research.公开透明?还是遮遮掩掩?——运动医学研究中开放科学实践的元研究系统综述
J Orthop Sports Phys Ther. 2023 Dec;53(12):735-747. doi: 10.2519/jospt.2023.12016.
3
Peer Review and Scientific Publication at a Crossroads: Call for Research for the 10th International Congress on Peer Review and Scientific Publication.
大型语言模型与系统评价及综述概述(PRISMA 2020和PRIOR)中报告指南的依从性分析
J Med Syst. 2025 Jun 12;49(1):80. doi: 10.1007/s10916-025-02212-0.
同行评审与科学出版处于十字路口:呼吁为第十届国际同行评审与科学出版大会开展研究
JAMA. 2023 Oct 3;330(13):1232-1235. doi: 10.1001/jama.2023.17607.
4
Reminding Peer Reviewers of Reporting Guideline Items to Improve Completeness in Published Articles: Primary Results of 2 Randomized Trials.提醒同行评审员注意报告指南条目,以提高已发表文章的完整性:两项随机试验的主要结果。
JAMA Netw Open. 2023 Jun 1;6(6):e2317651. doi: 10.1001/jamanetworkopen.2023.17651.
5
Artificial intelligence hallucinations.人工智能幻觉
Crit Care. 2023 May 10;27(1):180. doi: 10.1186/s13054-023-04473-y.
6
Replication concerns in sports and exercise science: a narrative review of selected methodological issues in the field.体育与运动科学中的重复研究问题:对该领域若干选定方法学问题的叙述性综述
R Soc Open Sci. 2022 Dec 14;9(12):220946. doi: 10.1098/rsos.220946. eCollection 2022 Dec.
7
Reporting and transparent research practices in sports medicine and orthopaedic clinical trials: a meta-research study.运动医学和矫形外科临床试验中的报告和透明研究实践:一项元研究。
BMJ Open. 2022 Aug 8;12(8):e059347. doi: 10.1136/bmjopen-2021-059347.
8
The reporting standards of randomised controlled trials in leading medical journals between 2019 and 2020: a systematic review.2019年至2020年主要医学期刊中随机对照试验的报告标准:一项系统评价。
Ir J Med Sci. 2023 Feb;192(1):73-80. doi: 10.1007/s11845-022-02955-6. Epub 2022 Mar 3.
9
Open and transparent sports science research: the role of journals to move the field forward.开放透明的体育科学研究:期刊推动该领域发展的作用。
Knee Surg Sports Traumatol Arthrosc. 2022 Nov;30(11):3599-3601. doi: 10.1007/s00167-022-06893-9. Epub 2022 Jan 29.
10
Methods to assess research misconduct in health-related research: A scoping review.评估健康相关研究中科研不端行为的方法:范围综述。
J Clin Epidemiol. 2021 Aug;136:189-202. doi: 10.1016/j.jclinepi.2021.05.012. Epub 2021 May 24.