de Medeiros Bruno A S, Cai Liming, Flynn Peter J, Yan Yujing, Duan Xiaoshan, Marinho Lucas C, Anderson Christiane, Davis Charles C
Field Museum of Natural History, Chicago, IL, USA.
Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA.
Nat Ecol Evol. 2025 Jun 25. doi: 10.1038/s41559-025-02752-1.
Species identification using DNA barcodes has revolutionized biodiversity sciences. However, conventional barcoding methods may lack power and universal applicability across the tree of life. Alternative methods based on whole genome sequencing are hard to scale due to large data requirements. Here we develop a novel DNA-based identification method, varKoding, using exceptionally low-coverage genome skim data to create two-dimensional images representing the genomic signature of a species. Using these representations, we train neural networks for taxonomic identification. Applying a taxonomically verified novel genomic dataset of Malpighiales plant accessions, we optimize training hyperparameters and find the highest performance by combining a transformer architecture with a new modified chaos game representation. Greater than 91% precision is achieved despite minimal input data, exceeding alternative methods tested. We illustrate the broad utility of varKoding across several focal clades of eukaryotes and prokaryotes. We also train a model capable of identifying all species in the Sequence Read Archive of the National Center for Biotechnology Information using less than 10 Mbp sequencing data with 96% precision and 95% recall and robust to sequencing platforms. The varKoding approach offers enhanced computational efficiency and scalability, minimal data inputs robust to sequencing details and modularity for further development in biodiversity science.
使用DNA条形码进行物种鉴定彻底改变了生物多样性科学。然而,传统的条形码方法可能缺乏效力,且在整个生命之树上缺乏普遍适用性。基于全基因组测序的替代方法由于数据需求大而难以扩展。在此,我们开发了一种新的基于DNA的鉴定方法——varKoding,利用极低覆盖度的基因组重测序数据创建代表物种基因组特征的二维图像。利用这些表示,我们训练神经网络进行分类鉴定。应用经过分类验证的锦葵目植物种质的新基因组数据集,我们优化训练超参数,并通过将变压器架构与新的改进混沌游戏表示相结合,找到了最高性能。尽管输入数据极少,但精度仍超过91%,超过了所测试的替代方法。我们展示了varKoding在真核生物和原核生物的几个重点分支中的广泛实用性。我们还训练了一个模型,该模型能够使用少于10兆碱基的测序数据,以96%的精度和95%的召回率识别美国国立生物技术信息中心序列读取存档中的所有物种,并且对测序平台具有鲁棒性。varKoding方法提供了更高的计算效率和可扩展性、对测序细节具有鲁棒性的最少数据输入以及用于生物多样性科学进一步发展的模块化。