Molik David C, Pfrender Michael E, Emrich Scott J
Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA.
Electrical Engineering and Computer Science, University of Tennessee-Knoxville, Knoxville, TN 37996, USA.
Methods Protoc. 2020 Mar 12;3(1):22. doi: 10.3390/mps3010022.
The advent of next-generation sequencing has allowed for higher-throughput determination of which species live within a specific location. Here we establish that three analysis methods for estimating diversity within samples-namely, Operational Taxonomic Units; the newer Amplicon Sequence Variants; and a method commonly found in sequence analysis, minhash-are affected by various properties of these sequence data. Using simulations we show that the presence of Single Nucleotide Polymorphisms and the depth of coverage from each species affect the correlations between these approaches. Through this analysis, we provide insights which would affect the decisions on the application of each method. Specifically, the presence of sequence read errors and variability in sequence read coverage deferentially affects these processing methods.
新一代测序技术的出现使得在特定位置内生活的物种能够以更高的通量进行测定。在此,我们确定了三种用于估计样本多样性的分析方法,即操作分类单元、较新的扩增子序列变体以及序列分析中常见的一种方法——最小哈希,它们会受到这些序列数据的各种特性的影响。通过模拟,我们表明单核苷酸多态性的存在以及每个物种的覆盖深度会影响这些方法之间的相关性。通过这项分析,我们提供了一些见解,这些见解会影响对每种方法应用的决策。具体而言,序列读取错误的存在和序列读取覆盖度的变异性会对这些处理方法产生不同的影响。