Bonin Aurélie, Guerrieri Alessia, Ficetola Gentile Francesco
Department of Environmental Science and Policy, University of Milan, Milan, Italy.
Argaly, Bâtiment Clean space, Sainte-Hélène-du-Lac, France.
Mol Ecol Resour. 2023 Feb;23(2):368-381. doi: 10.1111/1755-0998.13709. Epub 2022 Sep 15.
Clustering approaches are pivotal to handle the many sequence variants obtained in DNA metabarcoding data sets, and therefore they have become a key step of metabarcoding analysis pipelines. Clustering often relies on a sequence similarity threshold to gather sequences into molecular operational taxonomic units (MOTUs), each of which ideally represents a homogeneous taxonomic entity (e.g., a species or a genus). However, the choice of the clustering threshold is rarely justified, and its impact on MOTU over-splitting or over-merging even less tested. Here, we evaluated clustering threshold values for several metabarcoding markers under different criteria: limitation of MOTU over-merging, limitation of MOTU over-splitting, and trade-off between over-merging and over-splitting. We extracted sequences from a public database for nine markers, ranging from generalist markers targeting Bacteria or Eukaryota, to more specific markers targeting a class or a subclass (e.g., Insecta, Oligochaeta). Based on the distributions of pairwise sequence similarities within species and within genera, and on the rates of over-splitting and over-merging across different clustering thresholds, we were able to propose threshold values minimizing the risk of over-splitting, that of over-merging, or offering a trade-off between the two risks. For generalist markers, high similarity thresholds (0.96-0.99) are generally appropriate, while more specific markers require lower values (0.85-0.96). These results do not support the use of a fixed clustering threshold. Instead, we advocate careful examination of the most appropriate threshold based on the research objectives, the potential costs of over-splitting and over-merging, and the features of the studied markers.
聚类方法对于处理DNA宏条形码数据集中获得的众多序列变异至关重要,因此它们已成为宏条形码分析流程的关键步骤。聚类通常依赖于序列相似性阈值,以将序列聚集成分子操作分类单元(MOTU),理想情况下,每个分子操作分类单元代表一个同质的分类实体(例如,一个物种或一个属)。然而,聚类阈值的选择很少有合理依据,其对MOTU过度拆分或过度合并的影响更是鲜有测试。在此,我们根据不同标准评估了几种宏条形码标记的聚类阈值:限制MOTU过度合并、限制MOTU过度拆分以及过度合并与过度拆分之间的权衡。我们从一个公共数据库中提取了针对九种标记的序列,这些标记从针对细菌或真核生物的通用标记,到针对一个纲或一个亚纲(如昆虫纲、寡毛纲)的更特异性标记。基于物种内和属内成对序列相似性的分布,以及不同聚类阈值下的过度拆分和过度合并率,我们能够提出使过度拆分风险、过度合并风险最小化或在两种风险之间进行权衡的阈值。对于通用标记,高相似性阈值(0.96 - 0.99)通常是合适的,而更特异性的标记则需要较低的值(0.85 - 0.96)。这些结果不支持使用固定的聚类阈值。相反,我们主张根据研究目标、过度拆分和过度合并的潜在成本以及所研究标记的特征,仔细审查最合适的阈值。