Yoon Sora, Nam Dougu
School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea.
Department of Mathematical Sciences, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea.
BMC Genomics. 2017 May 25;18(1):408. doi: 10.1186/s12864-017-3809-0.
In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data.
We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not.
We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis.
在对两个样本组的RNA测序(RNA-seq)读数计数数据进行差异表达分析时,已知高表达基因(或较长基因)更有可能出现差异表达,这被称为读数计数偏差(或基因长度偏差)。这种偏差对下游的基因本体过度表达分析有很大影响。然而,对于不同类型重复的RNA-seq数据,尚未对这种偏差进行系统分析。
通过对多个模拟和真实RNA-seq数据集的数学推导和测试,我们表明基因在负二项式读数计数模型中的离散系数是读数计数偏差(和基因长度偏差)的关键决定因素。我们证明读数计数偏差主要局限于基因离散度小的数据(例如技术重复以及一些基因相同的重复,如细胞系或近交动物),除了一些读数少的基因外,许多来自不相关样本的生物学重复数据不存在这种偏差。还表明样本置换GSEA方法会因读数计数偏差产生大量假阳性,而预排名方法则不会。
我们首次表明小基因方差(类似地,离散度)是读数计数偏差(和基因长度偏差)的主要原因,并分析了不同类型重复的RNA-seq数据中的读数计数偏差及其对基因集富集分析的影响。