Bioinformatics Core Research Group, Deakin University, Geelong, Australia.
Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
Bioinformatics. 2018 Aug 15;34(16):2870-2878. doi: 10.1093/bioinformatics/bty175.
Although seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that, without normalization or transformation, renders invalid many conventional analyses, including distance measures, correlation coefficients and multivariate statistical models.
The purpose of this review is to summarize the principles of compositional data analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing data, and highlight future directions with regard to this field of study.
Supplementary data are available at Bioinformatics online.
尽管很少被明确承认,但测序平台生成的计数数据实际上是一种组合,其中每个成分(例如基因或转录本)的丰度只有相对于该样本中的其他成分才有意义。这种特性源于检测技术本身,即每个样本记录的计数数量受到任意总和(即文库大小)的限制。因此,测序数据作为组合数据,存在于非欧几里得空间中,如果不进行归一化或转换,许多传统的分析方法(包括距离度量、相关系数和多元统计模型)都是无效的。
本综述的目的是总结组合数据分析(CoDA)的原理,提供测序数据为何具有组合性的证据,讨论可用于分析测序数据的组合有效方法,并强调该研究领域的未来方向。
补充数据可在 Bioinformatics 在线获得。