Suppr超能文献

人类与 GPT-4 评估病例报告和病例系列研究方法学质量的一致性:使用 Murad 工具。

Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool.

机构信息

Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic, 200 1st Street SW, Rochester, MN, 55905, USA.

Division of Public Health, Infectious Diseases and Occupational Medicine, Mayo Clinic, Rochester, MN, USA.

出版信息

BMC Med Res Methodol. 2024 Nov 4;24(1):266. doi: 10.1186/s12874-024-02372-6.

Abstract

BACKGROUND

Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series.

METHODS

We searched Scopus for systematic reviews published in 2023-2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment.

RESULTS

We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases.

CONCLUSIONS

The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.

摘要

背景

由于人为判断的变异性和时间限制,评估病例报告和病例系列的方法学质量具有挑战性。我们评估了在应用专为病例报告和系列设计的标准方法学质量评估工具时,人类评审员和 GPT-4 之间判断的一致性。

方法

我们在 Scopus 中搜索了 2023-2024 年发表的引用了 Murad 等人评估工具的系统评价。开发了一个基于 GPT-4 的代理,使用该工具的 8 个信号问题来评估方法学质量。通过比较人类评审员的发表判断和 GPT-4 的评估,估计了观察到的一致性和一致性系数。

结果

我们纳入了 797 篇病例报告和系列。八个问题的观察一致性在 41.91%到 80.93%之间(一致性系数在 25.39 到 79.72%之间)。在关于病例选择的第一个信号问题中,一致性最低。在影响因子<5 和≥5 的期刊上发表的文章以及在排除未使用 3 个因果关系问题的系统评价时,一致性相似。使用相同提示重复分析表明,除了关于病例选择的第一个问题外,GPT-4 之间的两次尝试之间存在高度一致性。

结论

该研究表明,在使用 Murad 工具评估病例系列和报告的方法学质量时,GPT-4 与人类评审员之间存在中等程度的一致性。GPT-4 当前的性能似乎很有希望,但不太可能足以满足系统评价的严格性,需要将模型与人类评审员配对。

相似文献

2
Benchmarking Human-AI collaboration for common evidence appraisal tools.针对常见证据评估工具的人机协作基准测试。
J Clin Epidemiol. 2024 Nov;175:111533. doi: 10.1016/j.jclinepi.2024.111533. Epub 2024 Sep 12.

本文引用的文献

3
Case study research and causal inference.案例研究与因果推断。
BMC Med Res Methodol. 2022 Dec 1;22(1):307. doi: 10.1186/s12874-022-01790-8.
7
Kappa and Beyond: Is There Agreement?卡帕值及其他:是否存在一致性?
Global Spine J. 2020 Jun;10(4):499-501. doi: 10.1177/2192568220911648. Epub 2020 Mar 3.
10
Methodological quality and synthesis of case series and case reports.病例系列和病例报告的方法学质量与综合分析
BMJ Evid Based Med. 2018 Apr;23(2):60-63. doi: 10.1136/bmjebm-2017-110853. Epub 2018 Feb 2.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验