转录测序计数分布中的有限大小效应：幂律修正必然先于下游的标准化和比较分析。

Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis.

机构信息

Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.

Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore.

出版信息

Biol Direct. 2018 Feb 12;13(1):2. doi: 10.1186/s13062-018-0204-y.

DOI:10.1186/s13062-018-0204-y

PMID:29433547

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5809866/

Abstract

BACKGROUND

Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip's law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finite-size effects will set in to force a power-law distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a log-log plot. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified.

RESULTS

The anecdotal description of transcript abundance being almost Zipf's law-like distributed can be conceptualized as the imperfect mathematical rendition of the Pareto power-law distribution when subjected to the finite-size effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Next-generation sequencing) datasets: an in-house generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spike-in miRNA data; Firstly, the finite-size effects causes the deviations of sequencing count data from Zipf's law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward power-law correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signal-to-noise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the power-law correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple power-law correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis.

CONCLUSIONS

The finite-size effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple power-law correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings.

REVIEWERS

This article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor.

摘要

背景

尽管早期的脊椎动物到低等真核生物转录物丰度建模工作特别指出了 Zip 定律，但观察到的分布通常偏离单一幂律斜率。事后看来，虽然临界现象的幂律在无限观测条件下是渐近推导出来的，但实际观测是有限的，有限大小的效应会迫使幂律分布进入指数衰减，因此，在对数-对数图上表现为曲率（即变化的指数值）。如果转录物丰度确实呈幂律分布，那么变化的指数表示变化的数学矩（例如，均值、方差），并产生异方差性，从而影响分析的统计严谨性。这种偏离渐近幂律对测序计数数据的影响从未真正被检验和量化。

结果

转录物丰度几乎类似于 Zipf 定律分布的轶事描述，可以被概念化为 Pareto 幂律分布在现实世界中的有限大小效应下的不完美数学表现；这与测序技术的进步无关，因为在实践中采样是有限的。我们的概念化与我们对两个现代 NGS（下一代测序）数据集的实证分析非常吻合：我们自己生成的两个胃癌细胞系（NUGC3 和 AGS）的稀释 miRNA 研究的内部产生的稀释 miRNA 研究和一个公开可用的 Spike-in miRNA 数据；首先，有限大小的效应导致测序计数数据偏离 Zipf 定律和测序实验的可重复性问题。其次，它表现为实验重复之间的异方差性，从而带来统计上的困扰。令人惊讶的是，简单的幂律校正可以将分布失真恢复为单个指数值，从而将数据异方差性降低 50%，并将统计/检测灵敏度提高高达 30%，而与下游映射和归一化方法无关。最重要的是，幂律校正可以平均提高不同数据系列归一化方法之间的显著调用的一致性，提高 22%。当呈现更高的序列深度（4 倍差异）时，一致性的提高是不对称的（较高测序深度实例为 32%，较低实例为 13%），表明简单的幂律校正可以提高测序深度更高的显著检测。最后，校正极大地增强了统计结论，并逃避了我们稀释分析中 NUGC3 细胞系对 AGS 的转移潜力。

结论

由于欠采样引起的有限大小效应通常会使转录物计数数据受到可重复性问题的困扰，但可以通过对计数分布进行简单的幂律校正来最小化。这种分布校正对研究的生物学解释和科学发现的严谨性有直接影响。

审稿人

本文由 Oliviero Carugo、Thomas Dandekar 和 Sandor Pongor 审稿。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1996/5809866/83ebfef3bdf7/13062_2018_204_Fig1_HTML.jpg

相似文献

Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis.转录测序计数分布中的有限大小效应：幂律修正必然先于下游的标准化和比较分析。

Biol Direct. 2018 Feb 12;13(1):2. doi: 10.1186/s13062-018-0204-y.

Can Zipf's law be adapted to normalize microarrays?齐普夫定律能否用于对微阵列进行标准化？

BMC Bioinformatics. 2005 Feb 23;6:37. doi: 10.1186/1471-2105-6-37.

Zipf's law leads to Heaps' law: analyzing their relation in finite-size systems.齐夫定律导致海普斯定律：分析有限系统中的它们之间的关系。

PLoS One. 2010 Dec 2;5(12):e14139. doi: 10.1371/journal.pone.0014139.

Modeling fractal structure of city-size distributions using correlation functions.用相关函数对城市规模分布的分形结构进行建模。

PLoS One. 2011;6(9):e24791. doi: 10.1371/journal.pone.0024791. Epub 2011 Sep 20.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Statistical analyses support power law distributions found in neuronal avalanches.统计分析支持神经元爆发中发现的幂律分布。

PLoS One. 2011;6(5):e19779. doi: 10.1371/journal.pone.0019779. Epub 2011 May 26.

The evolution of the exponent of Zipf's law in language ontogeny.语言个体发生中齐夫定律指数的演变。

PLoS One. 2013;8(3):e53227. doi: 10.1371/journal.pone.0053227. Epub 2013 Mar 13.

The power-law distribution in the geometrically growing system: Statistic of the COVID-19 pandemic.几何增长系统中的幂律分布：COVID-19 大流行的统计。

Chaos. 2022 Jan;32(1):013111. doi: 10.1063/5.0068220.

Stochastic model of Zipf's law and the universality of the power-law exponent.齐普夫定律的随机模型与幂律指数的普遍性

Phys Rev E Stat Nonlin Soft Matter Phys. 2014 Apr;89(4):042115. doi: 10.1103/PhysRevE.89.042115. Epub 2014 Apr 8.

Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports.医疗出院报告中齐夫定律、幂律和对数正态分布的实证分析。

Int J Med Inform. 2021 Jan;145:104324. doi: 10.1016/j.ijmedinf.2020.104324. Epub 2020 Nov 2.

本文引用的文献

Beyond Zipf's Law: The Lavalette Rank Function and Its Properties.超越齐普夫定律：拉瓦莱特排名函数及其性质。

PLoS One. 2016 Sep 22;11(9):e0163241. doi: 10.1371/journal.pone.0163241. eCollection 2016.

Metastatic spread in patients with gastric cancer.胃癌患者的转移扩散

Oncotarget. 2016 Aug 9;7(32):52307-52316. doi: 10.18632/oncotarget.10740.

Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data.从RNA测序数据计算推断转录本异构体丰度方法的比较评估

Genome Biol. 2015 Jul 23;16(1):150. doi: 10.1186/s13059-015-0702-5.

Gastrointestinal stromal tumor solitary distant recurrence in the left brachialis muscle.胃肠道间质瘤孤立性远处复发于左肱肌。

World J Gastroenterol. 2015 May 28;21(20):6404-8. doi: 10.3748/wjg.v21.i20.6404.

Optimization of miRNA-seq data preprocessing.微小RNA测序数据预处理的优化

Brief Bioinform. 2015 Nov;16(6):950-63. doi: 10.1093/bib/bbv019. Epub 2015 Apr 17.

Breast metastases of gastric signet-ring cell carcinoma: a report of two cases and review of the literature.胃印戒细胞癌的乳腺转移：两例报告并文献复习

Onco Targets Ther. 2014 Dec 29;8:91-7. doi: 10.2147/OTT.S67921. eCollection 2015.

HTSeq--a Python framework to work with high-throughput sequencing data.HTSeq——一个用于处理高通量测序数据的Python框架。

Bioinformatics. 2015 Jan 15;31(2):166-9. doi: 10.1093/bioinformatics/btu638. Epub 2014 Sep 25.

Trimmomatic: a flexible trimmer for Illumina sequence data.Trimmomatic：一款适用于 Illumina 测序数据的灵活修剪工具。

Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.

A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.Illumina 高通量 RNA 测序数据分析中标准化方法的综合评估。

Brief Bioinform. 2013 Nov;14(6):671-83. doi: 10.1093/bib/bbs046. Epub 2012 Sep 17.

Evaluation of normalization methods in mammalian microRNA-Seq data.哺乳动物 microRNA-Seq 数据标准化方法的评估。

RNA. 2012 Jun;18(6):1279-88. doi: 10.1261/rna.030916.111. Epub 2012 Apr 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

转录测序计数分布中的有限大小效应：幂律修正必然先于下游的标准化和比较分析。

Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

REVIEWERS

背景

结果

结论

审稿人

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献