Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore.
Biol Direct. 2018 Feb 12;13(1):2. doi: 10.1186/s13062-018-0204-y.
Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip's law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finite-size effects will set in to force a power-law distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a log-log plot. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified.
The anecdotal description of transcript abundance being almost Zipf's law-like distributed can be conceptualized as the imperfect mathematical rendition of the Pareto power-law distribution when subjected to the finite-size effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Next-generation sequencing) datasets: an in-house generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spike-in miRNA data; Firstly, the finite-size effects causes the deviations of sequencing count data from Zipf's law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward power-law correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signal-to-noise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the power-law correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple power-law correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis.
The finite-size effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple power-law correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings.
This article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor.
尽管早期的脊椎动物到低等真核生物转录物丰度建模工作特别指出了 Zip 定律,但观察到的分布通常偏离单一幂律斜率。事后看来,虽然临界现象的幂律在无限观测条件下是渐近推导出来的,但实际观测是有限的,有限大小的效应会迫使幂律分布进入指数衰减,因此,在对数-对数图上表现为曲率(即变化的指数值)。如果转录物丰度确实呈幂律分布,那么变化的指数表示变化的数学矩(例如,均值、方差),并产生异方差性,从而影响分析的统计严谨性。这种偏离渐近幂律对测序计数数据的影响从未真正被检验和量化。
转录物丰度几乎类似于 Zipf 定律分布的轶事描述,可以被概念化为 Pareto 幂律分布在现实世界中的有限大小效应下的不完美数学表现;这与测序技术的进步无关,因为在实践中采样是有限的。我们的概念化与我们对两个现代 NGS(下一代测序)数据集的实证分析非常吻合:我们自己生成的两个胃癌细胞系(NUGC3 和 AGS)的稀释 miRNA 研究的内部产生的稀释 miRNA 研究和一个公开可用的 Spike-in miRNA 数据;首先,有限大小的效应导致测序计数数据偏离 Zipf 定律和测序实验的可重复性问题。其次,它表现为实验重复之间的异方差性,从而带来统计上的困扰。令人惊讶的是,简单的幂律校正可以将分布失真恢复为单个指数值,从而将数据异方差性降低 50%,并将统计/检测灵敏度提高高达 30%,而与下游映射和归一化方法无关。最重要的是,幂律校正可以平均提高不同数据系列归一化方法之间的显著调用的一致性,提高 22%。当呈现更高的序列深度(4 倍差异)时,一致性的提高是不对称的(较高测序深度实例为 32%,较低实例为 13%),表明简单的幂律校正可以提高测序深度更高的显著检测。最后,校正极大地增强了统计结论,并逃避了我们稀释分析中 NUGC3 细胞系对 AGS 的转移潜力。
由于欠采样引起的有限大小效应通常会使转录物计数数据受到可重复性问题的困扰,但可以通过对计数分布进行简单的幂律校正来最小化。这种分布校正对研究的生物学解释和科学发现的严谨性有直接影响。
本文由 Oliviero Carugo、Thomas Dandekar 和 Sandor Pongor 审稿。