Ramchandani Rashi, Guo Eddie, Rakab Esra, Rathod Jharna, Strain Jamie, Klement William, Shorr Risa, Williams Erin, Jones Daniel, Gilbert Sebastien
Department of Medicine, University of Ottawa, Ottawa, Ontario, Canada.
Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
PeerJ Comput Sci. 2025 Apr 30;11:e2822. doi: 10.7717/peerj-cs.2822. eCollection 2025.
Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy.
A literature search was run by a trained librarian to identify studies ( = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4's inclusion and exclusion decision were compared to those decided human reviewers.
The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening.
This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.
大语言模型(LLMs)为系统性综述中劳动强度大的问题提供了一个潜在的解决方案。本研究评估了GPT模型识别讨论食管切除术并发症围手术期危险因素文章的能力。为了测试该模型的性能,我们在更窄的纳入标准下测试了GPT-4,并评估其区分仅识别食管切除术术前危险因素的相关文章的能力。
由一名训练有素的图书馆员进行文献检索,以识别讨论食管切除术并发症危险因素的研究(n = 1967)。这些文章由三名独立的人类评审员和GPT-4进行标题和摘要筛选。用于分析的Python脚本通过自然语言的筛选标准对GPT-4进行应用程序编程接口(API)调用。将GPT-4的纳入和排除决定与人类评审员的决定进行比较。
GPT模型与人类决定在围手术期因素方面的一致性为85.58%,术前因素方面为78.75%。围手术期和术前危险因素查询的AUC值分别为0.87和0.75。在围手术期危险因素评估中,GPT模型对纳入研究的召回率较高,为89%,阳性预测值为74%,阴性预测值为84%,假阳性率较低,为6%,宏F1分数为0.81。对于术前危险因素,该模型对纳入研究的召回率为67%,阳性预测值为65%,阴性预测值为85%,假阳性率为15%,宏F1分数为0.66。观察者间的可靠性较高,围手术期因素的kappa分数为0.69,术前因素为0.61。尽管在更严格的标准下准确性较低,但GPT模型在简化系统性综述工作流程方面被证明是有价值的。据报道,研究筛选人员认为GPT模型提供的纳入和排除理由的初步评估很有用,特别是在标题和摘要筛选过程中解决差异时。
本研究表明大语言模型在简化系统性综述工作流程方面有良好的应用前景。将大语言模型整合到系统性综述中可以显著节省时间和成本,然而,对于涉及严格和更窄纳入及排除标准的综述必须谨慎。需要进一步的研究,应探索将大语言模型整合到系统性综述的其他步骤中,如全文筛选或数据提取,并比较不同大语言模型在各种类型系统性综述中的有效性。