大型语言模型（ChatGPT）在系统评价数据提取中的性能批判性评估：探索性研究

Critical Assessment of Large Language Models' (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study.

作者信息

Mahmoudi Hesam, Chang Doris, Lee Hannah, Ghaffarzadegan Navid, Jalali Mohammad S

机构信息

MGH Institute for Technology Assessment, Harvard Medical School, 125 Nashua St, Boston, MA, 02114, United States, 1 6177243738.

Industrial and System Engineering Department, Virginia Tech, Falls Church, VA, United States.

出版信息

JMIR AI. 2025 Sep 11;4:e68097. doi: 10.2196/68097.

DOI:10.2196/68097

PMID:40934529

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12425462/

Abstract

BACKGROUND

Systematic literature reviews (SLRs) are foundational for synthesizing evidence across diverse fields and are especially important in guiding research and practice in health and biomedical sciences. However, they are labor intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks and extract basic information, understanding their ability to accurately extract explicit data from academic papers is critical for advancing SLRs.

OBJECTIVE

Our study aimed to explore the capability of LLMs to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).

METHODS

We screened the full text of a sample of COVID-19 modeling studies and analyzed three basic measures of study settings (ie, analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (ie, mobility, risk perception, and compliance). To extract data on these measures, two researchers independently extracted 60 data elements using manual coding and compared them with the responses from ChatGPT to 420 queries spanning 7 iterations.

RESULTS

ChatGPT's accuracy improved as prompts were refined, showing improvements of 33% and 23% between the initial and final iterations for extracting study settings and behavioral components, respectively. In the initial prompts, 26 (43.3%) of 60 ChatGPT responses were correct. However, in the final iteration, ChatGPT extracted 43 (71.7%) of the 60 data elements, showing better performance in extracting explicitly stated study settings (28/30, 93.3%) than in extracting subjective behavioral components (15/30, 50%). Nonetheless, the varying accuracy across measures highlighted its limitations.

CONCLUSIONS

Our findings underscore LLMs' utility in extracting basic as well as explicit data in SLRs by using effective prompts. However, the results reveal significant limitations in handling nuanced, subjective criteria, emphasizing the necessity for human oversight.

摘要

背景

系统文献综述（SLR）是综合不同领域证据的基础，在指导健康与生物医学科学的研究及实践中尤为重要。然而，由于需要从多项研究中手动提取数据，它们工作量很大。随着大语言模型（LLM）因其自动化研究任务和提取基本信息的潜力而受到关注，了解其从学术论文中准确提取明确数据的能力对于推进系统文献综述至关重要。

目的

我们的研究旨在使用ChatGPT（GPT-4）探索大语言模型提取明确列出的研究特征以及需要细致评估的更深层次、更具背景信息的能力。

方法

我们筛选了一组COVID-19建模研究的全文，并分析了研究设置的三个基本指标（即分析地点、建模方法和分析的干预措施）以及模型中行为成分的三个复杂指标（即流动性、风险认知和依从性）。为了提取这些指标的数据，两名研究人员使用手动编码独立提取了60个数据元素，并将其与ChatGPT对420个问题（跨越7次迭代）的回答进行比较。

结果

随着提示的优化，ChatGPT的准确性有所提高，在提取研究设置和行为成分方面，初始迭代和最终迭代之间分别提高了33%和23%。在初始提示中，ChatGPT的60个回答中有26个（43.3%）正确。然而，在最终迭代中，ChatGPT提取了60个数据元素中的43个（71.7%），在提取明确陈述的研究设置方面（28/30，93.3%）比提取主观行为成分方面（15/30，50%）表现更好。尽管如此，不同指标的准确性差异突出了其局限性。