Biometris, Wageningen University & Research, Wageningen, The Netherlands.
Biointeractions and Plant Health, Wageningen University & Research, Wageningen, The Netherlands.
Mol Ecol Resour. 2021 Aug;21(6):1866-1874. doi: 10.1111/1755-0998.13391. Epub 2021 May 3.
Microbiome composition data collected through amplicon sequencing are count data on taxa in which the total count per sample (the library size) is an artefact of the sequencing platform, and as a result, such data are compositional. To avoid library size dependency, one common way of analysing multivariate compositional data is to perform a principal component analysis (PCA) on data transformed with the centred log-ratio, hereafter called a log-ratio PCA. Two aspects typical of amplicon sequencing data are the large differences in library size and the large number of zeroes. In this study, we show on real data and by simulation that, applied to data that combine these two aspects, log-ratio PCA is nevertheless heavily dependent on the library size. This leads to a reduction in power when testing against any explanatory variable in log-ratio redundancy analysis. If there is additionally a correlation between the library size and the explanatory variable, then the type 1 error becomes inflated. We explore putative solutions to this problem.
通过扩增子测序收集的微生物组组成数据是关于分类单元的计数数据,其中每个样本的总计数(库大小)是测序平台的人为产物,因此,此类数据具有组成性。为了避免库大小依赖性,一种常见的分析多元组成数据的方法是对经过中心对数比转换的数据进行主成分分析(PCA),以下简称对数比 PCA。扩增子测序数据的两个典型特征是库大小差异大和大量零值。在这项研究中,我们通过真实数据和模拟表明,对数比 PCA 应用于结合了这两个方面的数据时,仍然严重依赖于库大小。这导致在对数比冗余分析中针对任何解释变量进行检验时的功效降低。如果库大小与解释变量之间存在相关性,那么第一类错误就会膨胀。我们探索了解决这个问题的可能方法。