Chiu Chun-Huo, Chao Anne
Institute of Statistics, National Tsing Hua University , Hsin-Chu , Taiwan.
PeerJ. 2016 Feb 1;4:e1634. doi: 10.7717/peerj.1634. eCollection 2016.
Estimating and comparing microbial diversity are statistically challenging due to limited sampling and possible sequencing errors for low-frequency counts, producing spurious singletons. The inflated singleton count seriously affects statistical analysis and inferences about microbial diversity. Previous statistical approaches to tackle the sequencing errors generally require different parametric assumptions about the sampling model or about the functional form of frequency counts. Different parametric assumptions may lead to drastically different diversity estimates. We focus on nonparametric methods which are universally valid for all parametric assumptions and can be used to compare diversity across communities. We develop here a nonparametric estimator of the true singleton count to replace the spurious singleton count in all methods/approaches. Our estimator of the true singleton count is in terms of the frequency counts of doubletons, tripletons and quadrupletons, provided these three frequency counts are reliable. To quantify microbial alpha diversity for an individual community, we adopt the measure of Hill numbers (effective number of taxa) under a nonparametric framework. Hill numbers, parameterized by an order q that determines the measures' emphasis on rare or common species, include taxa richness (q = 0), Shannon diversity (q = 1, the exponential of Shannon entropy), and Simpson diversity (q = 2, the inverse of Simpson index). A diversity profile which depicts the Hill number as a function of order q conveys all information contained in a taxa abundance distribution. Based on the estimated singleton count and the original non-singleton frequency counts, two statistical approaches (non-asymptotic and asymptotic) are developed to compare microbial diversity for multiple communities. (1) A non-asymptotic approach refers to the comparison of estimated diversities of standardized samples with a common finite sample size or sample completeness. This approach aims to compare diversity estimates for equally-large or equally-complete samples; it is based on the seamless rarefaction and extrapolation sampling curves of Hill numbers, specifically for q = 0, 1 and 2. (2) An asymptotic approach refers to the comparison of the estimated asymptotic diversity profiles. That is, this approach compares the estimated profiles for complete samples or samples whose size tends to be sufficiently large. It is based on statistical estimation of the true Hill number of any order q ≥ 0. In the two approaches, replacing the spurious singleton count by our estimated count, we can greatly remove the positive biases associated with diversity estimates due to spurious singletons and also make fair comparisons across microbial communities, as illustrated in our simulation results and in applying our method to analyze sequencing data from viral metagenomes.
由于采样有限以及低频计数可能存在的测序错误会产生虚假单例,估计和比较微生物多样性在统计上具有挑战性。膨胀的单例计数严重影响对微生物多样性的统计分析和推断。以往处理测序错误的统计方法通常需要对采样模型或频率计数的函数形式做出不同的参数假设。不同的参数假设可能导致截然不同的多样性估计。我们专注于非参数方法,这些方法对所有参数假设普遍有效,可用于比较不同群落的多样性。我们在此开发了一种真实单例计数的非参数估计器,以在所有方法/途径中取代虚假单例计数。我们的真实单例计数估计器是根据双例、三例和四例的频率计数得出的,前提是这三个频率计数是可靠的。为了量化单个群落的微生物α多样性,我们在非参数框架下采用希尔数(分类单元有效数量)的度量。希尔数由阶数q参数化,q决定了该度量对稀有或常见物种的强调程度,包括分类单元丰富度(q = 0)、香农多样性(q = 1,香农熵的指数)和辛普森多样性(q = 2,辛普森指数的倒数)。描绘希尔数作为阶数q的函数的多样性剖面图传达了分类单元丰度分布中包含的所有信息。基于估计的单例计数和原始的非单例频率计数,开发了两种统计方法(非渐近和渐近)来比较多个群落的微生物多样性。(1)非渐近方法是指对具有相同有限样本大小或样本完整性的标准化样本的估计多样性进行比较。该方法旨在比较等大或等完整样本的多样性估计;它基于希尔数的无缝稀疏化和外推采样曲线,特别是对于q = 0、1和2。(2)渐近方法是指对估计的渐近多样性剖面图进行比较。也就是说,该方法比较完整样本或大小趋于足够大的样本的估计剖面图。它基于对任何阶数q≥0的真实希尔数的统计估计。在这两种方法中,用我们估计的计数取代虚假单例计数,我们可以大大消除由于虚假单例导致的与多样性估计相关的正偏差,并且还能在微生物群落之间进行公平比较,如我们的模拟结果以及将我们的方法应用于分析病毒宏基因组测序数据所示。