Greenacre Michael, Martínez-Álvaro Marina, Blasco Agustín
Department of Economics and Business, Universitat Pompeu Fabra, Barcelona, Spain.
Department of Agriculture, Horticulture and Engineering Sciences, Scotland's Rural College, Edinburgh, United Kingdom.
Front Microbiol. 2021 Oct 11;12:727398. doi: 10.3389/fmicb.2021.727398. eCollection 2021.
Microbiome and omics datasets are, by their intrinsic biological nature, of high dimensionality, characterized by counts of large numbers of components (microbial genes, operational taxonomic units, RNA transcripts, etc.). These data are generally regarded as compositional since the total number of counts identified within a sample is irrelevant. The central concept in compositional data analysis is the logratio transformation, the simplest being the additive logratios with respect to a fixed reference component. A full set of additive logratios is not isometric, that is they do not reproduce the geometry of all pairwise logratios exactly, but their lack of isometry can be measured by the Procrustes correlation. The reference component can be chosen to maximize the Procrustes correlation between the additive logratio geometry and the exact logratio geometry, and for high-dimensional data there are many potential references. As a secondary criterion, minimizing the variance of the reference component's log-transformed relative abundance values makes the subsequent interpretation of the logratios even easier. On each of three high-dimensional omics datasets the additive logratio transformation was performed, using references that were identified according to the abovementioned criteria. For each dataset the compositional data structure was successfully reproduced, that is the additive logratios were very close to being isometric. The Procrustes correlations achieved for these datasets were 0.9991, 0.9974, and 0.9902, respectively. We thus demonstrate, for high-dimensional compositional data, that additive logratios can provide a valid choice as transformed variables, which (a) are subcompositionally coherent, (b) explain 100% of the total logratio variance and (c) come measurably very close to being isometric. The interpretation of additive logratios is much simpler than the complex isometric alternatives and, when the variance of the log-transformed reference is very low, it is even simpler since each additive logratio can be identified with a corresponding compositional component.
微生物组和组学数据集因其内在的生物学性质而具有高维度,其特征是大量成分(微生物基因、操作分类单元、RNA转录本等)的计数。由于样本中鉴定出的计数总数无关紧要,这些数据通常被视为成分数据。成分数据分析的核心概念是对数比率变换,最简单的是相对于固定参考成分的加法对数比率。一组完整的加法对数比率不是等距的,也就是说它们不能精确再现所有成对对数比率的几何形状,但它们缺乏等距性可以通过普罗克汝斯忒斯相关性来衡量。可以选择参考成分以最大化加法对数比率几何形状与精确对数比率几何形状之间的普罗克汝斯忒斯相关性,对于高维数据有许多潜在的参考成分。作为次要标准,最小化参考成分的对数变换相对丰度值的方差会使对数比率的后续解释更加容易。在三个高维组学数据集上,均使用根据上述标准确定的参考成分进行了加法对数比率变换。对于每个数据集,成分数据结构都成功再现,即加法对数比率非常接近等距。这些数据集实现的普罗克汝斯忒斯相关性分别为0.9991、0.9974和0.9902。因此,我们证明,对于高维成分数据,加法对数比率可以作为变换变量提供有效的选择,这些变量(a)在子成分上是连贯的,(b)解释了总对数比率方差的100%,并且(c)在可测量的程度上非常接近等距。加法对数比率的解释比复杂的等距替代方法要简单得多,并且当对数变换参考的方差非常低时,解释甚至更简单,因为每个加法对数比率都可以与相应的成分成分相关联。