Department of Biology, McGill University, Montreal, QC, Canada.
SHARCNET, University of Guelph, Guelph, ON, Canada.
Mol Ecol Resour. 2021 Oct;21(7):2190-2203. doi: 10.1111/1755-0998.13407. Epub 2021 May 24.
The effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to genus or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging as researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities by using COI barcode fragments. The mock communities varied in species number and specimen abundance, while upstream molecular and bioinformatic variables were held constant, and using a set of COI fragments. We evaluated the impact of parameter optimization on the quality of the predictions. Our results indicate that BLAST top hit competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods that were benchmarked proved more sensitive to reference database heterogeneity and completeness than methods based on sequence similarity. The accuracy of assignments was impacted by both species and specimen counts (query compositional heterogeneity) which ultimately influence the selection of appropriate software. We urge researchers to: (i) use realistic mock communities to allow optimization of parameters, regardless of the taxonomic assignment method employed; (ii) carefully choose and curate the reference databases including completeness; and (iii) use QIIME, BLAST or LCA methods, in conjunction with parameter tuning to better assign taxonomy to diverse communities, especially when information on species diversity is lacking for the area under study.
代谢组学在生物多样性科学中的有效应用带来了重要的分析挑战,因为需要生成准确的分类分配。将序列分配到属或种水平对于生物多样性调查和生物监测至关重要,但由于研究人员必须选择最能恢复物种组成信息的方法,因此这特别具有挑战性。本研究通过使用 COI 条码片段评估了七种方法在恢复模拟群落物种组成方面的性能和准确性。模拟群落的物种数量和标本丰度不同,而上游分子和生物信息学变量保持不变,并使用一组 COI 片段。我们评估了参数优化对预测质量的影响。我们的结果表明,如果针对所研究的模拟群落进行优化,BLAST 顶级命中与更复杂的方法竞争良好。例如,经过基准测试的两种机器学习方法对参考数据库异质性和完整性比基于序列相似性的方法更敏感。分配的准确性受到物种和标本数量(查询组成异质性)的影响,这最终影响了合适软件的选择。我们敦促研究人员:(i)使用现实的模拟群落来允许优化参数,而不管使用的分类分配方法如何;(ii)仔细选择和管理参考数据库,包括完整性;(iii)使用 QIIME、BLAST 或 LCA 方法,结合参数调整,以更好地对不同的群落进行分类,特别是在研究区域缺乏有关物种多样性的信息时。