随机临床试验文章报告质量的大语言模型分析：一项系统评价

Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review.

作者信息

Srinivasan Apoorva, Berkowitz Jacob, Friedrich Nadine A, Kivelson Sophia, Tatonetti Nicholas P

机构信息

Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, California.

Cedars Sinai Cancer, Cedars Sinai Medical Center, Los Angeles, California.

出版信息

JAMA Netw Open. 2025 Aug 1;8(8):e2529418. doi: 10.1001/jamanetworkopen.2025.29418.

DOI:10.1001/jamanetworkopen.2025.29418

PMID:40875232

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12395317/

Abstract

IMPORTANCE

Incomplete reporting in randomized clinical trials (RCTs) obscures bias and limits reproducibility. Manual audits for adherence to the Consolidated Standards of Reporting Trials (CONSORT) guideline cannot keep pace with publication volume.

OBJECTIVES

To build and validate a zero-shot large-language-model (LLM) pipeline for automated CONSORT assessment and to map reporting quality over time, biomedical disciplines, and trial features.

DESIGN, SETTING, AND PARTICIPANTS: This systematic review included RCTs that were indexed on PubMed, available in English, open access, human-participant research, and published between MONTH 1966 to MONTH 2024. PubMed PDFs were converted to XML and linked with Semantic Scholar and ClinicalTrials.gov metadata. Chat GPT-4o-mini was tested on the 50-article CONSORT-Text Classification Model (CONSORT-TM) benchmark, checked by experts in 70 randomly sampled RCTs, and then applied to the full sample.

EXPOSURE

Publication year, biomedical discipline, funding source, trial phase, US Food and Drug Administration regulation, and oversight features.

MAIN OUTCOMES AND MEASURES

The LLM judged whether each of 21 CONSORT items was met. Primary outcomes were (1) model performance vs expert review (precision, recall, and macro F1 score) and (2) proportion of items reported.

RESULTS

Of 53 137 screened PDFs, 21 041 RCTs (median [IQR] publication year, 2014 [2003-2020]; 30 disciplines) were included, with a registry-linked subset of 1790 RCTs that had a median (IQR) planned enrollment of 210 (95-440) participants. In the 70-article validation set (2210 decisions) LLM outputs matched experts 91.7% of the time (2026 of 2210 decision); the macro F1 score on CONSORT-TM was 0.86 (95% CI, 0.84-0.87). Mean CONSORT compliance increased from 27.3% (95% CI, 27.0%-27.6%) in 1966 to 1990 to 57.0% (95% CI, 56.8%-57.2%) in 2010 to 2024. However, reporting critical elements remained uncommon, such as allocation-concealment mechanism (16.1% [95% CI, 15.6%-16.6%]) and external-validity discussion (1.6% [95% CI, 1.5%-1.8%]). Compliance varied across disciplines from 35.2% (95% CI, 34.8%-35.6%) in pharmacology to 63.4% (95% CI, 62.1%-64.7%) in urology and showed only negligible associations with clinical trial characteristics (all Cramer V <0.10).

CONCLUSIONS AND RELEVANCE

In this systemic review of RCTs, a zero-shot LLM audited CONSORT adherence at scale, uncovering persistent reporting gaps and wide disciplinary variation across biomedical fields, underscoring the need for targeted editorial action to boost transparency and reproducibility.

摘要

重要性

随机临床试验（RCT）报告不完整会掩盖偏差并限制可重复性。人工审核随机对照试验报告标准（CONSORT）指南的遵守情况无法跟上出版物的数量。

目的

构建并验证一个用于自动CONSORT评估的零样本大语言模型（LLM）管道，并绘制随时间、生物医学学科和试验特征的报告质量图。

设计、设置和参与者：本系统评价纳入了在PubMed上索引、英文可用、开放获取、涉及人类参与者研究且于1966年1月至2024年1月期间发表的RCT。PubMed的PDF文件被转换为XML，并与Semantic Scholar和ClinicalTrials.gov元数据链接。Chat GPT - 4o - mini在50篇文章的CONSORT文本分类模型（CONSORT - TM）基准上进行测试，由70个随机抽样的RCT中的专家进行检查，然后应用于整个样本。

暴露因素

发表年份、生物医学学科、资金来源、试验阶段、美国食品药品监督管理局监管以及监督特征。

主要结局和测量指标

LLM判断21项CONSORT条款中的每一项是否符合。主要结局为（1）模型性能与专家评审的比较（精确率、召回率和宏F1分数）以及（2）报告条款的比例。

结果

在筛选的53137篇PDF中，纳入了21041项RCT（发表年份中位数[四分位间距]，2014年[2003 - 2020年]；30个学科），其中1790项RCT的注册链接子集的计划入组中位数（四分位间距）为210名（95 - 440名）参与者。在70篇文章的验证集（2210个决策）中，LLM输出与专家意见在91.7%的情况下相符（2210个决策中的2026个）；CONSORT - TM上的宏F1分数为0.86（95%置信区间，0.84 - 0.87）。CONSORT合规性从1966年至1990年的27.3%（95%置信区间，27.0% - 27.6%）增加到2010年至2024年的57.0%（95%置信区间，56.8% - 57.2%）。然而，关键要素的报告仍然不常见，如分配隐藏机制（16.1%[95%置信区间，15.6% - 16.6%]）和外部有效性讨论（1.6%[95%置信区间，1.5% - 1.8%]）。各学科的合规性有所不同，从药理学的35.2%（95%置信区间，34.8% - 35.6%）到泌尿外科的63.4%（95%置信区间，62.1% - 64.7%），并且与临床试验特征的关联仅可忽略不计（所有克莱姆V值<0.10）。

结论和相关性

在这项对RCT的系统评价中，一个零样本LLM大规模审核了CONSORT的遵守情况，揭示了持续存在的报告差距以及生物医学领域广泛的学科差异，强调了需要有针对性的编辑行动来提高透明度和可重复性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

随机临床试验文章报告质量的大语言模型分析：一项系统评价

Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVES

EXPOSURE

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

重要性

目的

暴露因素

主要结局和测量指标

结果

结论和相关性

相似文献

本文引用的文献

随机临床试验文章报告质量的大语言模型分析：一项系统评价

Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVES

EXPOSURE

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

重要性

目的

暴露因素

主要结局和测量指标

结果

结论和相关性

相似文献

本文引用的文献