Iyer Radhika, Christie Alec Philip, Madhavapeddy Anil, Reynolds Sam, Sutherland William, Jaffer Sadiq
Department of Zoology, University of Cambridge, Cambridge United Kingdom.
Centre for Environmental Policy, Imperial College London, United Kingdom.
PLoS One. 2025 May 15;20(5):e0323563. doi: 10.1371/journal.pone.0323563. eCollection 2025.
Wise use of evidence to support efficient conservation action is key to tackling biodiversity loss with limited time and resources. Evidence syntheses provide key recommendations for conservation decision-makers by assessing and summarising evidence, but are not always easy to access, digest, and use. Recent advances in Large Language Models (LLMs) present both opportunities and risks in enabling faster and more intuitive systems to access evidence syntheses and databases. Such systems for natural language search and open-ended evidence-based responses are pipelines comprising many components. Most critical of these components are the LLM used and how evidence is retrieved from the database. We evaluate the performance of ten LLMs across six different database retrieval strategies against human experts in answering synthetic multiple-choice question exams on the effects of conservation interventions using the Conservation Evidence database. We found that LLM performance was comparable with human experts over 45 filtered questions, both in correctly answering them and retrieving the document used to generate them. Across 1867 unfiltered questions, LLM performance demonstrated a level of conservation-specific knowledge, but this varied across topic areas. A hybrid retrieval strategy that combines keywords and vector embeddings performed best by a substantial margin. We also tested against a state-of-the-art previous generation LLM which was outperformed by all ten current models - including smaller, cheaper models. Our findings suggest that, with careful domain-specific design, LLMs could potentially be powerful tools for enabling expert-level use of evidence syntheses and databases in different disciplines. However, general LLMs used 'out-of-the-box' are likely to perform poorly and misinform decision-makers. By establishing that LLMs exhibit comparable performance with human synthesis experts on providing restricted responses to queries of evidence syntheses and databases, future work can build on our approach to quantify LLM performance in providing open-ended responses.
明智地运用证据来支持高效的保护行动,是在时间和资源有限的情况下应对生物多样性丧失的关键。证据综合通过评估和总结证据,为保护决策者提供关键建议,但这些建议并非总是易于获取、理解和运用。大语言模型(LLMs)的最新进展,在实现更快、更直观的系统以访问证据综合和数据库方面,既带来了机遇,也带来了风险。这种用于自然语言搜索和基于证据的开放式回答的系统是由许多组件组成的管道。这些组件中最关键的是所使用的大语言模型以及从数据库中检索证据的方式。我们针对人类专家,评估了十种大语言模型在六种不同数据库检索策略下,回答使用保护证据数据库的关于保护干预效果的综合多项选择题考试的表现。我们发现,在45个经过筛选的问题上,大语言模型的表现与人类专家相当,在正确回答问题以及检索用于生成问题的文档方面都是如此。在1867个未经过筛选的问题中,大语言模型的表现展示了一定程度的特定于保护领域的知识,但这在不同主题领域有所不同。一种结合了关键词和向量嵌入的混合检索策略以较大优势表现最佳。我们还与一个先前的先进一代大语言模型进行了测试比较,该模型被所有十个当前模型超越,包括更小、更便宜的模型。我们的研究结果表明,通过精心的特定领域设计,大语言模型有可能成为强大的工具,使不同学科的专家能够利用证据综合和数据库。然而,未经调整直接使用的通用大语言模型可能表现不佳,并误导决策者。通过确定大语言模型在对证据综合和数据库查询提供受限回答方面与人类综合专家表现相当,未来的工作可以基于我们的方法来量化大语言模型在提供开放式回答方面的表现。