Suppr超能文献

DNA宏条形码研究中分子操作分类单元聚类的最佳序列相似性阈值

Optimal sequence similarity thresholds for clustering of molecular operational taxonomic units in DNA metabarcoding studies.

作者信息

Bonin Aurélie, Guerrieri Alessia, Ficetola Gentile Francesco

机构信息

Department of Environmental Science and Policy, University of Milan, Milan, Italy.

Argaly, Bâtiment Clean space, Sainte-Hélène-du-Lac, France.

出版信息

Mol Ecol Resour. 2023 Feb;23(2):368-381. doi: 10.1111/1755-0998.13709. Epub 2022 Sep 15.

Abstract

Clustering approaches are pivotal to handle the many sequence variants obtained in DNA metabarcoding data sets, and therefore they have become a key step of metabarcoding analysis pipelines. Clustering often relies on a sequence similarity threshold to gather sequences into molecular operational taxonomic units (MOTUs), each of which ideally represents a homogeneous taxonomic entity (e.g., a species or a genus). However, the choice of the clustering threshold is rarely justified, and its impact on MOTU over-splitting or over-merging even less tested. Here, we evaluated clustering threshold values for several metabarcoding markers under different criteria: limitation of MOTU over-merging, limitation of MOTU over-splitting, and trade-off between over-merging and over-splitting. We extracted sequences from a public database for nine markers, ranging from generalist markers targeting Bacteria or Eukaryota, to more specific markers targeting a class or a subclass (e.g., Insecta, Oligochaeta). Based on the distributions of pairwise sequence similarities within species and within genera, and on the rates of over-splitting and over-merging across different clustering thresholds, we were able to propose threshold values minimizing the risk of over-splitting, that of over-merging, or offering a trade-off between the two risks. For generalist markers, high similarity thresholds (0.96-0.99) are generally appropriate, while more specific markers require lower values (0.85-0.96). These results do not support the use of a fixed clustering threshold. Instead, we advocate careful examination of the most appropriate threshold based on the research objectives, the potential costs of over-splitting and over-merging, and the features of the studied markers.

摘要

聚类方法对于处理DNA宏条形码数据集中获得的众多序列变异至关重要,因此它们已成为宏条形码分析流程的关键步骤。聚类通常依赖于序列相似性阈值,以将序列聚集成分子操作分类单元(MOTU),理想情况下,每个分子操作分类单元代表一个同质的分类实体(例如,一个物种或一个属)。然而,聚类阈值的选择很少有合理依据,其对MOTU过度拆分或过度合并的影响更是鲜有测试。在此,我们根据不同标准评估了几种宏条形码标记的聚类阈值:限制MOTU过度合并、限制MOTU过度拆分以及过度合并与过度拆分之间的权衡。我们从一个公共数据库中提取了针对九种标记的序列,这些标记从针对细菌或真核生物的通用标记,到针对一个纲或一个亚纲(如昆虫纲、寡毛纲)的更特异性标记。基于物种内和属内成对序列相似性的分布,以及不同聚类阈值下的过度拆分和过度合并率,我们能够提出使过度拆分风险、过度合并风险最小化或在两种风险之间进行权衡的阈值。对于通用标记,高相似性阈值(0.96 - 0.99)通常是合适的,而更特异性的标记则需要较低的值(0.85 - 0.96)。这些结果不支持使用固定的聚类阈值。相反,我们主张根据研究目标、过度拆分和过度合并的潜在成本以及所研究标记的特征,仔细审查最合适的阈值。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验