Suppr超能文献

使用大语言模型集成进行高性能自动摘要筛选。

High-performance automated abstract screening with large language model ensembles.

作者信息

Sanghera Rohan, Thirunavukarasu Arun James, El Khoury Marc, O'Logbon Jessica, Chen Yuqing, Watt Archie, Mahmood Mustafa, Butt Hamid, Nishimura George, Soltan Andrew A S

机构信息

Oxford University Hospitals NHS Foundation Trust, Oxford OX3 9DU, United Kingdom.

Oxford University Clinical Academic Graduate School, Medical Sciences Division, University of Oxford, Oxford OX3 9DU, United Kingdom.

出版信息

J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.

Abstract

OBJECTIVE

screening is a labor-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies. We aimed to validate large language models (LLMs) used to automate abstract screening.

MATERIALS AND METHODS

LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialed across 23 Cochrane Library systematic reviews to evaluate their accuracy in zero-shot binary classification for abstract screening. Initial evaluation on a balanced development dataset (n = 800) identified optimal prompting strategies, and the best performing LLM-prompt combinations were then validated on a comprehensive dataset of replicated search results (n = 119 695).

RESULTS

On the development dataset, LLMs exhibited superior performance to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). When evaluated on the comprehensive dataset, the best performing LLM-prompt combinations exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096) due to class imbalance. In addition, 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458 with the development dataset, decreasing to 0.1450 over the comprehensive dataset; but conferring workload reductions ranging between 37.55% and 99.11%.

DISCUSSION

Automated abstract screening can reduce the screening workload in systematic review while maintaining quality. Performance variation between reviews highlights the importance of domain-specific validation before autonomous deployment. LLM-human ensembles can achieve similar benefits while maintaining human oversight over all records.

CONCLUSION

LLMs may reduce the human labor cost of systematic review with maintained or improved accuracy, thereby increasing the efficiency and quality of evidence synthesis.

摘要

目的

筛选是系统评价中一项劳动密集型工作,涉及对大量研究反复应用纳入和排除标准。我们旨在验证用于自动化摘要筛选的大语言模型(LLMs)。

材料与方法

在23项Cochrane图书馆系统评价中对大语言模型(GPT - 3.5 Turbo、GPT - 4 Turbo、GPT - 4o、Llama 3 70B、Gemini 1.5 Pro和Claude Sonnet 3.5)进行测试,以评估它们在摘要筛选的零样本二元分类中的准确性。在平衡的开发数据集(n = 800)上的初步评估确定了最佳提示策略,然后在复制搜索结果的综合数据集(n = 119695)上对表现最佳的大语言模型 - 提示组合进行验证。

结果

在开发数据集上,大语言模型在敏感性(大语言模型最大值 = 1.000,人类最大值 = 0.775)、精确率(大语言模型最大值 = 0.927,人类最大值 = 0.911)和平衡准确率(大语言模型最大值 = 0.904,人类最大值 = 0.865)方面表现优于人类研究人员。在综合数据集上进行评估时,由于类别不平衡,表现最佳的大语言模型 - 提示组合表现出一致的敏感性(范围0.756 - 1.000),但精确率有所下降(范围0.004 - 0.096)。此外,66个大语言模型 - 人类和大语言模型 - 大语言模型集成在开发数据集上表现出完美的敏感性,最大精确率为0.458,在综合数据集上降至0.1450;但可减少37.55%至99.11%的工作量。

讨论

自动化摘要筛选可以在保持质量的同时减少系统评价中的筛选工作量。不同评价之间的性能差异凸显了在自主部署前进行特定领域验证的重要性。大语言模型 - 人类集成可以在保持对所有记录人工监督的同时实现类似的益处。

结论

大语言模型可以在保持或提高准确性的同时降低系统评价的人力成本,从而提高证据综合的效率和质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5eeb/12012331/495321d03c0a/ocaf050f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验