文献检索，用中文搜 PubMed

BACKGROUND

Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series.

METHODS

We searched Scopus for systematic reviews published in 2023-2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment.

RESULTS

We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases.

CONCLUSIONS

The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

由于人为判断的变异性和时间限制，评估病例报告和病例系列的方法学质量具有挑战性。我们评估了在应用专为病例报告和系列设计的标准方法学质量评估工具时，人类评审员和 GPT-4 之间判断的一致性。

方法

我们在 Scopus 中搜索了 2023-2024 年发表的引用了 Murad 等人评估工具的系统评价。开发了一个基于 GPT-4 的代理，使用该工具的 8 个信号问题来评估方法学质量。通过比较人类评审员的发表判断和 GPT-4 的评估，估计了观察到的一致性和一致性系数。

结果

我们纳入了 797 篇病例报告和系列。八个问题的观察一致性在 41.91%到 80.93%之间（一致性系数在 25.39 到 79.72%之间）。在关于病例选择的第一个信号问题中，一致性最低。在影响因子<5 和≥5 的期刊上发表的文章以及在排除未使用 3 个因果关系问题的系统评价时，一致性相似。使用相同提示重复分析表明，除了关于病例选择的第一个问题外，GPT-4 之间的两次尝试之间存在高度一致性。

结论

该研究表明，在使用 Murad 工具评估病例系列和报告的方法学质量时，GPT-4 与人类评审员之间存在中等程度的一致性。GPT-4 当前的性能似乎很有希望，但不太可能足以满足系统评价的严格性，需要将模型与人类评审员配对。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

人类与 GPT-4 评估病例报告和病例系列研究方法学质量的一致性：使用 Murad 工具。

Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

相似文献

本文引用的文献

相似文献

本文引用的文献

人类与 GPT-4 评估病例报告和病例系列研究方法学质量的一致性：使用 Murad 工具。

Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论