Wrightson James G, Blazey Paul, Moher David, Khan Karim M, Ardern Clare L
Department of Physical Therapy, The University of British Columbia Faculty of Medicine, Vancouver, British Columbia, Canada.
Centre for Aging SMART, The University of British Columbia, Vancouver, British Columbia, Canada.
BMJ Open. 2025 Mar 18;15(3):e088735. doi: 10.1136/bmjopen-2024-088735.
Adherence to established reporting guidelines can improve clinical trial reporting standards, but attempts to improve adherence have produced mixed results. This exploratory study aimed to determine how accurate a large language model generative artificial intelligence system (AI-LLM) was for determining reporting guideline compliance in a sample of sports medicine clinical trial reports.
This study was an exploratory retrospective data analysis. OpenAI GPT-4 and Meta Llama 2 AI-LLM were evaluated for their ability to determine reporting guideline adherence in a sample of sports medicine and exercise science clinical trial reports.
Academic research institution.
The study sample included 113 published sports medicine and exercise science clinical trial papers. For each paper, the GPT-4 Turbo and Llama 2 70B models were prompted to answer a series of nine reporting guideline questions about the text of the article. The GPT-4 Vision model was prompted to answer two additional reporting guideline questions about the participant flow diagram in a subset of articles. The dataset was randomly split (80/20) into a TRAIN and TEST dataset. Hyperparameter and fine-tuning were performed using the TRAIN dataset. The Llama 2 model was fine-tuned using the data from the GPT-4 Turbo analysis of the TRAIN dataset.
The primary outcome was the F1-score, a measure of model performance on the TEST dataset. The secondary outcome was the model's classification accuracy (%).
Across all questions about the article text, the GPT-4 Turbo AI-LLM demonstrated acceptable performance (F1-score=0.89, accuracy (95% CI) = 90% (85% to 94%)). Accuracy for all reporting guidelines was >80%. The Llama 2 model accuracy was initially poor (F1-score=0.63, accuracy (95% CI) = 64% (57% to 71%)) and improved with fine-tuning (F1-score=0.84, accuracy (95% CI) = 83% (77% to 88%)). The GPT-4 Vision model accurately identified all participant flow diagrams (accuracy (95% CI) = 100% (89% to 100%)) but was less accurate at identifying when details were missing from the flow diagram (accuracy (95% CI) = 57% (39% to 73%)).
Both the GPT-4 and fine-tuned Llama 2 AI-LLMs showed promise as tools for assessing reporting guideline compliance. Next steps should include developing an efficient, open-source AI-LLM and exploring methods to improve model accuracy.
遵循既定的报告指南可提高临床试验报告标准,但提高遵循度的尝试结果不一。这项探索性研究旨在确定大型语言模型生成式人工智能系统(AI-LLM)在确定运动医学临床试验报告样本中报告指南合规性方面的准确程度。
本研究为探索性回顾性数据分析。对OpenAI GPT-4和Meta Llama 2 AI-LLM在确定运动医学和运动科学临床试验报告样本中报告指南遵循度的能力进行了评估。
学术研究机构。
研究样本包括113篇已发表的运动医学和运动科学临床试验论文。对于每篇论文,提示GPT-4 Turbo和Llama 2 70B模型回答一系列关于文章文本的九个报告指南问题。提示GPT-4 Vision模型回答关于文章子集中参与者流程图的另外两个报告指南问题。数据集随机分为(80/20)训练集和测试集。使用训练集进行超参数调整和微调。Llama 2模型使用训练集的GPT-4 Turbo分析数据进行微调。
主要结局为F1分数,这是测试集上模型性能的一项指标。次要结局为模型的分类准确率(%)。
在所有关于文章文本的问题中,GPT-4 Turbo AI-LLM表现出可接受的性能(F1分数=0.89,准确率(95%CI)=90%(85%至94%))。所有报告指南的准确率均>80%。Llama 2模型的准确率最初较差(F1分数=0.63,准确率(95%CI)=64%(57%至71%)),经微调后有所提高(F1分数=0.84,准确率(95%CI)=83%(77%至88%))。GPT-4 Vision模型准确识别了所有参与者流程图(准确率(95%CI)=100%(89%至100%)),但在识别流程图中何时缺少细节方面准确性较低(准确率(95%CI)=57%(39%至73%))。
GPT-4和经过微调的Llama 2 AI-LLM都有望成为评估报告指南合规性的工具。下一步应包括开发一个高效的开源AI-LLM,并探索提高模型准确性的方法。