创建可解释的深度学习模型以利用环境DNA序列识别物种。

Creating interpretable deep learning models to identify species using environmental DNA sequences.

作者信息

Waggoner Samuel, Donnelly Jon, Gurung Rose, Jackson Laura, Chen Chaofan

机构信息

School of Computing and Information Science, University of Maine, Orono, 04469, USA.

Department of Computer Science, Duke University, Durham, 27708, USA.

出版信息

Sci Rep. 2025 Jul 28;15(1):27436. doi: 10.1038/s41598-025-09846-7.

DOI:10.1038/s41598-025-09846-7

PMID:40721613

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12304181/

Abstract

Monitoring species' presence in an ecosystem is crucial for conservation and understanding habitat diversity, but can be expensive and time consuming. As a result, ecologists have begun using the DNA that animals naturally leave behind in water or soil (called environmental DNA, or eDNA) to identify the species present in an environment. Recent work has shown that when used to identify species, convolutional neural networks (CNNs) can be as much as 150 times faster than ObiTools, a traditional method that does not use deep learning. However, CNNs are black boxes, meaning it is impossible to "fact check" why they predict that a given sequence belongs to a particular species. In this work, we introduce an interpretable, prototype-based CNN using the ProtoPNet framework that surpasses previous accuracy on a challenging eDNA dataset. The network is able to visualize the sequences of bases that are most distinctive for each species in the dataset, and introduces a novel skip connection that improves the interpretability of the original ProtoPNet. Our results show that reducing reliance on the convolutional output increases both interpretability and accuracy.

摘要

监测物种在生态系统中的存在对于保护和理解栖息地多样性至关重要，但可能成本高昂且耗时。因此，生态学家已开始利用动物自然留在水或土壤中的DNA（称为环境DNA，即eDNA）来识别环境中存在的物种。最近的研究表明，在用于识别物种时，卷积神经网络（CNN）的速度比不使用深度学习的传统方法ObiTools快多达150倍。然而，CNN是黑箱模型，这意味着不可能“核实”它们为何预测给定序列属于特定物种。在这项研究中，我们使用ProtoPNet框架引入了一种基于原型的可解释CNN，该模型在具有挑战性的eDNA数据集上超越了先前的准确率。该网络能够可视化数据集中每个物种最具特色的碱基序列，并引入了一种新颖的跳跃连接，提高了原始ProtoPNet的可解释性。我们的结果表明，减少对卷积输出的依赖可同时提高可解释性和准确率。