Sercombe Jayden, Bryant Zachary, Wilson Jack
The Matilda Centre for Research in Mental Health and Substance Use, University of Sydney, Jane Foss Russell Building (G02), Level 6, Sydney, 2006, Australia, 612 8627 9380.
JMIR Form Res. 2025 Aug 11;9:e68666. doi: 10.2196/68666.
Systematic reviews are essential for synthesizing research in health sciences; however, they are resource-intensive and prone to human error. The data extraction phase, in which key details of studies are identified and recorded in a systematic manner, may benefit from the application of automation processes. Recent advancements in artificial intelligence, specifically in large language models (LLMs) such as ChatGPT, may streamline this process.
This study aimed to develop and evaluate a custom Generative Pre-Training Transformer (GPT), named Systematic Review Extractor Pro, for automating the data extraction phase of systematic reviews in health research.
OpenAI's GPT Builder was used to create a GPT tailored to extract information from academic manuscripts. The Role, Instruction, Steps, End goal, and Narrowing (RISEN) framework was used to inform prompt engineering for the GPT. A sample of 20 studies from two distinct systematic reviews was used to evaluate the GPT's performance in extraction. Agreement rates between the GPT outputs and human reviewers were calculated for each study subsection.
The mean time for human data extraction was 36 minutes per study, compared to 26.6 seconds for GPT generation, followed by 13 minutes of human review. The GPT demonstrated high overall agreement rates with human reviewers, achieving 91.45% for review 1 and 89.31% for review 2. It was particularly accurate in extracting study characteristics (review 1: 95.25%; review 2: 90.83%) and participant characteristics (review 1: 95.03%; review 2: 90.00%), with lower performance observed in more complex areas such as methodological characteristics (87.07%) and statistical results (77.50%). The GPT correctly extracted data in 14 instances (3.25% in review 1) and four instances (1.16% in review 2) when the human reviewer was incorrect.
The custom GPT significantly reduced extraction time and shows evidence that it can extract data with high accuracy, particularly for participant and study characteristics. This tool may offer a viable option for researchers seeking to reduce resource demands during the extraction phase, although more research is needed to evaluate test-retest reliability, performance across broader review types, and accuracy in extracting statistical data. The tool developed in the current study has been made open access.
系统评价对于综合健康科学研究至关重要;然而,它们资源密集且容易出现人为错误。数据提取阶段,即研究的关键细节以系统方式被识别和记录的阶段,可能会从自动化流程的应用中受益。人工智能的最新进展,特别是像ChatGPT这样的大语言模型(LLMs),可能会简化这个过程。
本研究旨在开发并评估一个名为系统评价提取专业版的定制生成式预训练变换器(GPT),用于自动化健康研究中系统评价的数据提取阶段。
使用OpenAI的GPT构建器创建一个用于从学术手稿中提取信息的GPT。角色、指令、步骤、最终目标和细化(RISEN)框架用于指导GPT的提示工程。从两项不同的系统评价中选取20项研究的样本,用于评估GPT在提取方面的性能。计算GPT输出与人工评审员之间在每个研究子部分的一致率。
人工数据提取的平均时间为每项研究36分钟,而GPT生成的时间为26.6秒,随后人工评审需要13分钟。GPT与人工评审员的总体一致率较高,在第一次评价中达到91.45%,在第二次评价中达到89.31%。它在提取研究特征(第一次评价:95.25%;第二次评价:90.83%)和参与者特征(第一次评价:95.03%;第二次评价:90.00%)方面特别准确,在方法学特征(87.07%)和统计结果(77.50%)等更复杂的领域表现较低。当人工评审员错误时,GPT在14个实例(第一次评价中为3.25%)和4个实例(第二次评价中为1.16%)中正确提取了数据。
定制的GPT显著减少了提取时间,并表明它能够高精度地提取数据,特别是对于参与者和研究特征。尽管需要更多研究来评估重测信度、在更广泛的评价类型中的性能以及提取统计数据的准确性,但该工具可能为寻求在提取阶段减少资源需求的研究人员提供一个可行选择。本研究中开发的工具已开放获取。