Lovell David R, Chua Xin-Yi, McGrath Annette
Queensland University of Technology, Australia.
Data61, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia.
NAR Genom Bioinform. 2020 Jun 19;2(2):lqaa040. doi: 10.1093/nargab/lqaa040. eCollection 2020 Jun.
Thanks to sequencing technology, modern molecular bioscience datasets are often compositions of counts, e.g. counts of amplicons, mRNAs, etc. While there is growing appreciation that compositional data need special analysis and interpretation, less well understood is the discrete nature of these count compositions (or, as we call them, lattice compositions) and the impact this has on statistical analysis, particularly log-ratio analysis (LRA) of pairwise association. While LRA methods are scale-invariant, count compositional data are not; consequently, the conclusions we draw from LRA of lattice compositions depend on the scale of counts involved. We know that additive variation affects the relative abundance of small counts more than large counts; here we show that additive (quantization) variation comes from the discrete nature of count data itself, as well as (biological) variation in the system under study and (technical) variation from measurement and analysis processes. Variation due to quantization is inevitable, but its impact on conclusions depends on the underlying scale and distribution of counts. We illustrate the different distributions of real molecular bioscience data from different experimental settings to show why it is vital to understand the distributional characteristics of count data before applying and drawing conclusions from compositional data analysis methods.
得益于测序技术,现代分子生物科学数据集通常是计数的组合,例如扩增子、mRNA等的计数。虽然人们越来越认识到组合数据需要特殊的分析和解释,但这些计数组合(或者我们所称的格点组合)的离散性质以及这对统计分析,特别是成对关联的对数比率分析(LRA)的影响却鲜为人知。虽然LRA方法是尺度不变的,但计数组合数据并非如此;因此,我们从格点组合的LRA得出的结论取决于所涉及计数的尺度。我们知道,加性变异对小计数相对丰度的影响大于大计数;在这里我们表明,加性(量化)变异来自计数数据本身的离散性质,以及所研究系统中的(生物学)变异和测量与分析过程中的(技术)变异。量化引起的变异是不可避免的,但其对结论的影响取决于计数的潜在尺度和分布。我们展示了来自不同实验设置的真实分子生物科学数据的不同分布,以说明在应用组合数据分析方法并得出结论之前了解计数数据的分布特征为何至关重要。