Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic, 200 1st Street SW, Rochester, MN, 55905, USA.
Division of Public Health, Infectious Diseases and Occupational Medicine, Mayo Clinic, Rochester, MN, USA.
BMC Med Res Methodol. 2024 Nov 4;24(1):266. doi: 10.1186/s12874-024-02372-6.
Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series.
We searched Scopus for systematic reviews published in 2023-2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment.
We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases.
The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.
由于人为判断的变异性和时间限制,评估病例报告和病例系列的方法学质量具有挑战性。我们评估了在应用专为病例报告和系列设计的标准方法学质量评估工具时,人类评审员和 GPT-4 之间判断的一致性。
我们在 Scopus 中搜索了 2023-2024 年发表的引用了 Murad 等人评估工具的系统评价。开发了一个基于 GPT-4 的代理,使用该工具的 8 个信号问题来评估方法学质量。通过比较人类评审员的发表判断和 GPT-4 的评估,估计了观察到的一致性和一致性系数。
我们纳入了 797 篇病例报告和系列。八个问题的观察一致性在 41.91%到 80.93%之间(一致性系数在 25.39 到 79.72%之间)。在关于病例选择的第一个信号问题中,一致性最低。在影响因子<5 和≥5 的期刊上发表的文章以及在排除未使用 3 个因果关系问题的系统评价时,一致性相似。使用相同提示重复分析表明,除了关于病例选择的第一个问题外,GPT-4 之间的两次尝试之间存在高度一致性。
该研究表明,在使用 Murad 工具评估病例系列和报告的方法学质量时,GPT-4 与人类评审员之间存在中等程度的一致性。GPT-4 当前的性能似乎很有希望,但不太可能足以满足系统评价的严格性,需要将模型与人类评审员配对。