FuzzyID2：一个软件包，用于通过隐马尔可夫模型和模糊集方法对条形码和代谢组条形码进行大数据集物种鉴定。

FuzzyID2: A software package for large data set species identification via barcoding and metabarcoding using hidden Markov models and fuzzy set methods.

机构信息

College of Life Sciences, Capital Normal University, Beijing, China.

State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China.

出版信息

Mol Ecol Resour. 2018 May;18(3):666-675. doi: 10.1111/1755-0998.12738. Epub 2017 Dec 10.

DOI:10.1111/1755-0998.12738

PMID:29154499

Abstract

Species identification through DNA barcoding or metabarcoding has become a key approach for biodiversity evaluation and ecological studies. However, the rapid accumulation of barcoding data has created some difficulties: for instance, global enquiries to a large reference library can take a very long time. We here devise a two-step searching strategy to speed identification procedures of such queries. This firstly uses a Hidden Markov Model (HMM) algorithm to narrow the searching scope to genus level and then determines the corresponding species using minimum genetic distance. Moreover, using a fuzzy membership function, our approach also estimates the credibility of assignment results for each query. To perform this task, we developed a new software pipeline, FuzzyID2, using Python and C++. Performance of the new method was assessed using eight empirical data sets ranging from 70 to 234,535 barcodes. Five data sets (four animal, one plant) deployed the conventional barcode approach, one used metabarcodes, and two were eDNA-based. The results showed mean accuracies of generic and species identification of 98.60% (with a minimum of 95.00% and a maximum of 100.00%) and 94.17% (with a range of 84.40%-100.00%), respectively. Tests with simulated NGS sequences based on realistic eDNA and metabarcode data demonstrated that FuzzyID2 achieved a significantly higher identification success rate than the commonly used Blast method, and the TIPP method tends to find many fewer species than either FuzztID2 or Blast. Furthermore, data sets with tens of thousands of barcodes need only a few seconds for each query assignment using FuzzyID2. Our approach provides an efficient and accurate species identification protocol for biodiversity-related projects with large DNA sequence data sets.

摘要

通过 DNA 条码或代谢条码进行物种鉴定已成为评估生物多样性和生态研究的关键方法。然而，条码数据的快速积累也带来了一些困难：例如，对大型参考文库进行全球查询可能需要很长时间。我们设计了一种两步搜索策略，以加快此类查询的识别过程。该策略首先使用隐马尔可夫模型（HMM）算法将搜索范围缩小到属级，然后使用最小遗传距离确定相应的物种。此外，我们的方法还使用模糊隶属度函数来估计每个查询的分配结果的可信度。为了实现这一任务，我们使用 Python 和 C++ 开发了一个新的软件管道，名为 FuzzyID2。新方法的性能使用 70 到 234,535 个条码的 8 个经验数据集进行评估。其中 5 个数据集（4 个动物，1 个植物）使用传统条码方法，1 个使用代谢条码，2 个基于 eDNA。结果表明，通用和物种鉴定的平均准确率分别为 98.60%（最低 95.00%，最高 100.00%）和 94.17%（范围 84.40%-100.00%）。基于真实 eDNA 和代谢条码数据的模拟 NGS 序列测试表明，FuzzyID2 比常用的 Blast 方法具有更高的鉴定成功率，而 TIPP 方法往往比 FuzzyID2 或 Blast 发现的物种要少得多。此外，使用 FuzzyID2 对包含数万条条码的数据进行每个查询分配只需要几秒钟。我们的方法为具有大量 DNA 序列数据集的生物多样性相关项目提供了一种高效、准确的物种鉴定方案。