Department of Clinical Laboratory, National Center Hospital, National Center of Neurology and Psychiatry, Kodaira, Japan.
Department of Sleep-Wake Disorders, National Institute of Mental Health, National Center of Neurology and Psychiatry, Kodaira, Japan.
J Med Internet Res. 2024 Aug 16;26:e52758. doi: 10.2196/52758.
The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers.
We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records.
We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study's inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included.
On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria.
Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.
系统评价的筛选过程需要耗费大量资源。尽管之前的机器学习解决方案已经报告了工作量的减少,但它们有排除相关文献的风险。
我们评估了使用 GPT-3.5 和 GPT-4 的 3 层筛选方法在系统评价标题和摘要筛选过程中的性能。我们的目标是开发一种最大限度提高识别相关记录灵敏度的筛选方法。
我们对我们之前的两项关于双相情感障碍治疗的系统评价进行了筛选,第一项研究有 1381 条记录,第二项研究有 3146 条记录。使用 GPT-3.5(gpt-3.5-turbo-0125)和 GPT-4(gpt-4-0125-preview)进行了 3 层筛选:(1)研究设计,(2)目标患者,和(3)干预和对照。对每一项研究进行了定制的 3 层筛选提示。在此过程中,使用基于 GPT-4 的流程进行了根据每项研究的纳入标准进行的信息提取和优化,无需手动调整。对每个层的记录进行评估,在所有层都符合纳入标准的记录随后被判断为纳入。
在每个层,GPT-3.5 和 GPT-4 都能够每分钟处理约 110 条记录,分别对第一和第二项研究进行筛选的总时间约为 1 小时和 2 小时。在第一项研究中,GPT-3.5 和 GPT-4 的灵敏度/特异性分别为 0.900/0.709 和 0.806/0.996。GPT-3.5 和 GPT-4 的两次筛选都判断用于荟萃分析的所有 6 条记录均为纳入。在第二项研究中,GPT-3.5 和 GPT-4 的灵敏度/特异性分别为 0.958/0.116 和 0.875/0.855。GPT-3.5 和 GPT-4 的灵敏度与人类评估者的灵敏度一致:第一项研究为 0.867-1.000,第二项研究为 0.776-0.979。GPT-3.5 和 GPT-4 的两次筛选都判断用于荟萃分析的所有 9 条记录均为纳入。在考虑了 GPT-4 合理排除的记录后,GPT-4 筛选的灵敏度/特异性分别为第一项研究中的 0.962/0.996 和第二项研究中的 0.943/0.855。进一步的调查表明,GPT-3.5 错误排除的病例是由于缺乏领域知识,而 GPT-4 错误排除的病例是由于对纳入标准的误解。
我们的 GPT-4 3 层筛选方法具有可接受的灵敏度和特异性,支持其在系统评价筛选中的实际应用。未来的研究应旨在推广这种方法,并探索其在医学和非医学领域的有效性,以充分建立其使用和操作可行性。