Fergus Paul, Chalmers Carl, Matthews Naomi, Nixon Stuart, Burger André, Hartley Oliver, Sutherland Chris, Lambin Xavier, Longmore Steven, Wich Serge
School of Computer Science and Mathematics, Liverpool John Moores University, James Parsons Building, Byrom Street, Liverpool L3 3AF, UK.
Chester Zoo, Upton-by-Chester, Chester CH2 IEU, UK.
Sensors (Basel). 2024 Dec 19;24(24):8122. doi: 10.3390/s24248122.
Camera traps offer enormous new opportunities in ecological studies, but current automated image analysis methods often lack the contextual richness needed to support impactful conservation outcomes. Integrating vision-language models into these workflows could address this gap by providing enhanced contextual understanding and enabling advanced queries across temporal and spatial dimensions. Here, we present an integrated approach that combines deep learning-based vision and language models to improve ecological reporting using data from camera traps. We introduce a two-stage system: YOLOv10-X to localise and classify species (mammals and birds) within images and a Phi-3.5-vision-instruct model to read YOLOv10-X bounding box labels to identify species, overcoming its limitation with hard-to-classify objects in images. Additionally, Phi-3.5 detects broader variables, such as vegetation type and time of day, providing rich ecological and environmental context to YOLO's species detection output. When combined, this output is processed by the model's natural language system to answer complex queries, and retrieval-augmented generation (RAG) is employed to enrich responses with external information, like species weight and IUCN status (information that cannot be obtained through direct visual analysis). Combined, this information is used to automatically generate structured reports, providing biodiversity stakeholders with deeper insights into, for example, species abundance, distribution, animal behaviour, and habitat selection. Our approach delivers contextually rich narratives that aid in wildlife management decisions. By providing contextually rich insights, our approach not only reduces manual effort but also supports timely decision making in conservation, potentially shifting efforts from reactive to proactive.
相机陷阱在生态研究中提供了巨大的新机遇,但当前的自动图像分析方法往往缺乏支持有影响力的保护成果所需的丰富背景信息。将视觉语言模型集成到这些工作流程中,可以通过提供增强的背景理解并实现跨时空维度的高级查询来弥补这一差距。在这里,我们提出了一种综合方法,将基于深度学习的视觉和语言模型相结合,利用相机陷阱的数据改进生态报告。我们引入了一个两阶段系统:YOLOv10-X用于在图像中定位和分类物种(哺乳动物和鸟类),以及Phi-3.5视觉指导模型来读取YOLOv10-X的边界框标签以识别物种,克服其在图像中对难以分类的物体的局限性。此外,Phi-3.5检测更广泛的变量,如植被类型和一天中的时间,为YOLO的物种检测输出提供丰富的生态和环境背景。当两者结合时,该输出由模型的自然语言系统处理以回答复杂查询,并采用检索增强生成(RAG)用外部信息丰富响应,如物种体重和IUCN状态(这些信息无法通过直接视觉分析获得)。综合这些信息用于自动生成结构化报告,为生物多样性利益相关者提供对物种丰度、分布、动物行为和栖息地选择等方面更深入的见解。我们的方法提供了背景丰富的叙述,有助于野生动物管理决策。通过提供背景丰富的见解,我们的方法不仅减少了人工工作量,还支持保护工作中的及时决策,有可能将工作从被动转向主动。