Vector Institute for Artificial Intelligence, Toronto, Canada M5G 0C6.
Department of Computer Science, University of Toronto, Toronto, Canada M5S 2E4.
Philos Trans R Soc Lond B Biol Sci. 2024 Jun 24;379(1904):20230124. doi: 10.1098/rstb.2023.0124. Epub 2024 May 6.
DNA-based identification is vital for classifying biological specimens, yet methods to quantify the uncertainty of sequence-based taxonomic assignments are scarce. Challenges arise from noisy reference databases, including mislabelled entries and missing taxa. PROTAX addresses these issues with a probabilistic approach to taxonomic classification, advancing on methods that rely solely on sequence similarity. It provides calibrated probabilistic assignments to a partially populated taxonomic hierarchy, accounting for taxa that lack references and incorrect taxonomic annotation. While effective on smaller scales, global application of PROTAX necessitates substantially larger reference libraries, a goal previously hindered by computational barriers. We introduce PROTAX-GPU, a scalable algorithm capable of leveraging the global Barcode of Life Data System (>14 million specimens) as a reference database. Using graphics processing units (GPU) to accelerate similarity and nearest-neighbour operations and the JAX library for Python integration, we achieve over a 1000 × speedup compared with the central processing unit (CPU)-based implementation without compromising PROTAX's key benefits. PROTAX-GPU marks a significant stride towards real-time DNA barcoding, enabling quicker and more efficient species identification in environmental assessments. This capability opens up new avenues for real-time monitoring and analysis of biodiversity, advancing our ability to understand and respond to ecological dynamics. This article is part of the theme issue 'Towards a toolkit for global insect biodiversity monitoring'.
基于 DNA 的鉴定对于分类生物样本至关重要,但量化基于序列的分类任务不确定性的方法却很少。挑战来自嘈杂的参考数据库,包括标记错误的条目和缺失的分类单元。PROTAX 通过一种概率方法来解决这些问题,对分类方法进行了改进,不仅依靠序列相似性。它为部分填充的分类层次结构提供了校准的概率分配,考虑到缺乏参考和不正确分类注释的分类单元。虽然在较小的范围内效果很好,但 PROTAX 的全球应用需要更大的参考库,这一目标以前受到计算障碍的阻碍。我们引入了 PROTAX-GPU,这是一种可扩展的算法,能够利用全球生命条形码数据系统(超过 1400 万标本)作为参考数据库。我们使用图形处理单元(GPU)来加速相似性和最近邻操作,并使用 Python 集成的 JAX 库,与基于中央处理单元(CPU)的实现相比,实现了超过 1000 倍的加速,而不会影响 PROTAX 的关键优势。PROTAX-GPU 标志着实时 DNA 条形码技术迈出了重要一步,使环境评估中的物种鉴定更快、更高效。这种能力为实时监测和分析生物多样性开辟了新途径,提高了我们理解和应对生态动态的能力。本文是主题为“迈向全球昆虫生物多样性监测工具包”的一部分。