Suppr超能文献

基因离散度是RNA-seq数据差异表达分析中读取计数偏差的关键决定因素。

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data.

作者信息

Yoon Sora, Nam Dougu

机构信息

School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea.

Department of Mathematical Sciences, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea.

出版信息

BMC Genomics. 2017 May 25;18(1):408. doi: 10.1186/s12864-017-3809-0.

Abstract

BACKGROUND

In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data.

RESULTS

We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not.

CONCLUSION

We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis.

摘要

背景

在对两个样本组的RNA测序(RNA-seq)读数计数数据进行差异表达分析时,已知高表达基因(或较长基因)更有可能出现差异表达,这被称为读数计数偏差(或基因长度偏差)。这种偏差对下游的基因本体过度表达分析有很大影响。然而,对于不同类型重复的RNA-seq数据,尚未对这种偏差进行系统分析。

结果

通过对多个模拟和真实RNA-seq数据集的数学推导和测试,我们表明基因在负二项式读数计数模型中的离散系数是读数计数偏差(和基因长度偏差)的关键决定因素。我们证明读数计数偏差主要局限于基因离散度小的数据(例如技术重复以及一些基因相同的重复,如细胞系或近交动物),除了一些读数少的基因外,许多来自不相关样本的生物学重复数据不存在这种偏差。还表明样本置换GSEA方法会因读数计数偏差产生大量假阳性,而预排名方法则不会。

结论

我们首次表明小基因方差(类似地,离散度)是读数计数偏差(和基因长度偏差)的主要原因,并分析了不同类型重复的RNA-seq数据中的读数计数偏差及其对基因集富集分析的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f5d/5445461/409517c81ce2/12864_2017_3809_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验