O'Rourke Devon R, Bokulich Nicholas A, Jusino Michelle A, MacManes Matthew D, Foster Jeffrey T
Department of Molecular, Cellular, and Biomedical Sciences University of New Hampshire Durham NH USA.
Pathogen and Microbiome Institute Northern Arizona University Flagstaff AZ USA.
Ecol Evol. 2020 Jul 23;10(18):9721-9739. doi: 10.1002/ece3.6594. eCollection 2020 Sep.
Metabarcoding studies provide a powerful approach to estimate the diversity and abundance of organisms in mixed communities in nature. While strategies exist for optimizing sample and sequence library preparation, best practices for bioinformatic processing of amplicon sequence data are lacking in animal diet studies. Here we evaluate how decisions made in core bioinformatic processes, including sequence filtering, database design, and classification, can influence animal metabarcoding results. We show that denoising methods have lower error rates compared to traditional clustering methods, although these differences are largely mitigated by removing low-abundance sequence variants. We also found that available reference datasets from GenBank and BOLD for the animal marker gene cytochrome oxidase I (COI) can be complementary, and we discuss methods to improve existing databases to include versioned releases. Taxonomic classification methods can dramatically affect results. For example, the commonly used Barcode of Life Database (BOLD) Classification API assigned fewer names to samples from order through species levels using both a mock community and bat guano samples compared to all other classifiers (vsearch-SINTAX and q2-feature-classifier's BLAST + LCA, VSEARCH + LCA, and Naive Bayes classifiers). The lack of consensus on bioinformatics best practices limits comparisons among studies and may introduce biases. Our work suggests that biological mock communities offer a useful standard to evaluate the myriad computational decisions impacting animal metabarcoding accuracy. Further, these comparisons highlight the need for continual evaluations as new tools are adopted to ensure that the inferences drawn reflect meaningful biology instead of digital artifacts.
宏条形码研究为估计自然混合群落中生物的多样性和丰度提供了一种强大的方法。虽然存在优化样本和序列文库制备的策略,但动物饮食研究中缺乏对扩增子序列数据进行生物信息学处理的最佳实践。在这里,我们评估了核心生物信息学过程中所做的决策,包括序列过滤、数据库设计和分类,如何影响动物宏条形码结果。我们表明,与传统聚类方法相比,去噪方法具有更低的错误率,尽管通过去除低丰度序列变体,这些差异在很大程度上得到了缓解。我们还发现,来自GenBank和BOLD的动物标记基因细胞色素氧化酶I(COI)的可用参考数据集可以互补,并且我们讨论了改进现有数据库以纳入版本发布的方法。分类方法会极大地影响结果。例如,与所有其他分类器(vsearch-SINTAX和q2-feature-classifier的BLAST + LCA、VSEARCH + LCA和朴素贝叶斯分类器)相比,常用的生命条形码数据库(BOLD)分类API使用模拟群落和蝙蝠粪便样本从目到种水平为样本分配的名称更少。生物信息学最佳实践缺乏共识限制了研究之间的比较,并可能引入偏差。我们的工作表明,生物模拟群落为评估影响动物宏条形码准确性的众多计算决策提供了一个有用的标准。此外,这些比较凸显了随着新工具的采用需要持续评估,以确保得出的推论反映有意义的生物学内容而非数字假象。