Department of Marine Ecology, Centre for Advanced Studies of Blanes (CEAB-CSIC), Blanes (Girona), Catalonia, Spain.
Department of Evolutionary Biology, Ecology and Environmental Sciences, University of Barcelona and Research Institute of Biodiversity (IRBIO), Barcelona, Catalonia, Spain.
BMC Bioinformatics. 2021 Apr 5;22(1):177. doi: 10.1186/s12859-021-04115-6.
The recent blooming of metabarcoding applications to biodiversity studies comes with some relevant methodological debates. One such issue concerns the treatment of reads by denoising or by clustering methods, which have been wrongly presented as alternatives. It has also been suggested that denoised sequence variants should replace clusters as the basic unit of metabarcoding analyses, missing the fact that sequence clusters are a proxy for species-level entities, the basic unit in biodiversity studies. We argue here that methods developed and tested for ribosomal markers have been uncritically applied to highly variable markers such as cytochrome oxidase I (COI) without conceptual or operational (e.g., parameter setting) adjustment. COI has a naturally high intraspecies variability that should be assessed and reported, as it is a source of highly valuable information. We contend that denoising and clustering are not alternatives. Rather, they are complementary and both should be used together in COI metabarcoding pipelines.
Using a COI dataset from benthic marine communities, we compared two denoising procedures (based on the UNOISE3 and the DADA2 algorithms), set suitable parameters for denoising and clustering, and applied these steps in different orders. Our results indicated that the UNOISE3 algorithm preserved a higher intra-cluster variability. We introduce the program DnoisE to implement the UNOISE3 algorithm taking into account the natural variability (measured as entropy) of each codon position in protein-coding genes. This correction increased the number of sequences retained by 88%. The order of the steps (denoising and clustering) had little influence on the final outcome.
We highlight the need for combining denoising and clustering, with adequate choice of stringency parameters, in COI metabarcoding. We present a program that uses the coding properties of this marker to improve the denoising step. We recommend researchers to report their results in terms of both denoised sequences (a proxy for haplotypes) and clusters formed (a proxy for species), and to avoid collapsing the sequences of the latter into a single representative. This will allow studies at the cluster (ideally equating species-level diversity) and at the intra-cluster level, and will ease additivity and comparability between studies.
近年来,代谢条形码技术在生物多样性研究中的应用引发了一些相关的方法学争论。其中一个问题涉及到对reads 的处理,即使用去噪或聚类方法,这两种方法被错误地认为是相互替代的。还有人认为,去噪后的序列变体应该替代聚类作为代谢条形码分析的基本单位,而忽略了这样一个事实,即序列聚类是物种级实体的代理,是生物多样性研究的基本单位。我们在这里认为,为核糖体标记物开发和测试的方法未经批判性地应用于高度可变的标记物,如细胞色素氧化酶 I (COI),而没有在概念或操作(例如参数设置)上进行调整。COI 具有自然的高种内变异性,应该进行评估和报告,因为它是非常有价值的信息来源。我们认为去噪和聚类不是替代关系,而是互补关系,在 COI 代谢条形码分析中应该一起使用。
使用来自底栖海洋群落的 COI 数据集,我们比较了两种去噪程序(基于 UNOISE3 和 DADA2 算法),为去噪和聚类设置了合适的参数,并以不同的顺序应用这些步骤。我们的结果表明,UNOISE3 算法保留了更高的聚类内变异性。我们引入了程序 DnoisE,以考虑蛋白质编码基因中每个密码子位置的自然变异性(以熵来衡量)来实现 UNOISE3 算法。这种校正将保留的序列数量增加了 88%。步骤的顺序(去噪和聚类)对最终结果的影响很小。
我们强调需要在 COI 代谢条形码中结合去噪和聚类,并适当选择严格性参数。我们提出了一个程序,该程序利用该标记物的编码特性来改进去噪步骤。我们建议研究人员报告去噪序列(代表单倍型)和形成的聚类(代表物种)的结果,并避免将后者的序列合并成一个单一的代表。这将允许在聚类(理想情况下等同于物种多样性)和聚类内水平上进行研究,并简化研究之间的可加性和可比性。