基因离散度是RNA-seq数据差异表达分析中读取计数偏差的关键决定因素。

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data.

作者信息

Yoon Sora, Nam Dougu

机构信息

School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea.

Department of Mathematical Sciences, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea.

出版信息

BMC Genomics. 2017 May 25;18(1):408. doi: 10.1186/s12864-017-3809-0.

DOI:10.1186/s12864-017-3809-0

PMID:28545404

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5445461/

Abstract

BACKGROUND

In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data.

RESULTS

We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not.

CONCLUSION

We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis.

摘要

背景

在对两个样本组的RNA测序（RNA-seq）读数计数数据进行差异表达分析时，已知高表达基因（或较长基因）更有可能出现差异表达，这被称为读数计数偏差（或基因长度偏差）。这种偏差对下游的基因本体过度表达分析有很大影响。然而，对于不同类型重复的RNA-seq数据，尚未对这种偏差进行系统分析。

结果

通过对多个模拟和真实RNA-seq数据集的数学推导和测试，我们表明基因在负二项式读数计数模型中的离散系数是读数计数偏差（和基因长度偏差）的关键决定因素。我们证明读数计数偏差主要局限于基因离散度小的数据（例如技术重复以及一些基因相同的重复，如细胞系或近交动物），除了一些读数少的基因外，许多来自不相关样本的生物学重复数据不存在这种偏差。还表明样本置换GSEA方法会因读数计数偏差产生大量假阳性，而预排名方法则不会。

结论

我们首次表明小基因方差（类似地，离散度）是读数计数偏差（和基因长度偏差）的主要原因，并分析了不同类型重复的RNA-seq数据中的读数计数偏差及其对基因集富集分析的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f5d/5445461/409517c81ce2/12864_2017_3809_Fig1_HTML.jpg

相似文献

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data.基因离散度是RNA-seq数据差异表达分析中读取计数偏差的关键决定因素。

BMC Genomics. 2017 May 25;18(1):408. doi: 10.1186/s12864-017-3809-0.

Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster.使用来自726只黑腹果蝇个体的RNA测序数据进行标准化和差异表达分析的比较。

BMC Genomics. 2016 Jan 5;17:28. doi: 10.1186/s12864-015-2353-z.

Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates.利用小样本重复改进RNA测序数据的基因集富集分析

PLoS One. 2016 Nov 9;11(11):e0165919. doi: 10.1371/journal.pone.0165919. eCollection 2016.

Statistical detection of differentially expressed genes based on RNA-seq: from biological to phylogenetic replicates.基于 RNA-seq 的差异表达基因的统计检测：从生物学重复到系统发育重复。

Brief Bioinform. 2016 Mar;17(2):243-8. doi: 10.1093/bib/bbv035. Epub 2015 Jun 24.

A fuzzy method for RNA-Seq differential expression analysis in presence of multireads.一种用于存在多重读取情况下RNA测序差异表达分析的模糊方法。

BMC Bioinformatics. 2016 Nov 8;17(Suppl 12):345. doi: 10.1186/s12859-016-1195-2.

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data.用于RNA测序数据差异表达分析的每个样本全局缩放和每个基因归一化方法的比较。

PLoS One. 2017 May 1;12(5):e0176185. doi: 10.1371/journal.pone.0176185. eCollection 2017.

Differential gene expression analysis using coexpression and RNA-Seq data.基于共表达和 RNA-Seq 数据的差异基因表达分析。

Bioinformatics. 2013 Sep 1;29(17):2153-61. doi: 10.1093/bioinformatics/btt363. Epub 2013 Jun 21.

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads.通过纳入非外显子映射读数对RNA测序数据进行差异表达分析。

BMC Genomics. 2015;16 Suppl 7(Suppl 7):S14. doi: 10.1186/1471-2164-16-S7-S14. Epub 2015 Jun 11.

Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment.源自双条件48次重复实验的RNA测序数据的统计模型。

Bioinformatics. 2015 Nov 15;31(22):3625-30. doi: 10.1093/bioinformatics/btv425. Epub 2015 Jul 23.

A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments.一种灵活的计数数据模型，可适用于广泛复制的 RNA-seq 实验所产生的广泛多样化的表达谱。

BMC Bioinformatics. 2013 Aug 21;14:254. doi: 10.1186/1471-2105-14-254.

引用本文的文献

ARTP Mutagenesis of BZ103 to Enhance Laccase Activity and Transcriptomic Analysis of the Mutants.通过常压室温等离子体诱变BZ103提高漆酶活性及突变体的转录组分析

J Microbiol Biotechnol. 2025 Jun 12;35:e2502014. doi: 10.4014/jmb.2502.02014.

Bdelloid rotifers deploy horizontally acquired biosynthetic genes against a fungal pathogen.蛭形轮虫针对真菌病原体水平获得的生物合成基因进行防御。

Nat Commun. 2024 Jul 18;15(1):5787. doi: 10.1038/s41467-024-49919-1.

Identification of key biomarkers and associated pathways of pancreatic cancer using integrated transcriptomic and gene network analysis.运用整合转录组学和基因网络分析鉴定胰腺癌的关键生物标志物及相关通路

Saudi J Biol Sci. 2023 Nov;30(11):103819. doi: 10.1016/j.sjbs.2023.103819. Epub 2023 Sep 26.

Simultaneous co-infection with swine influenza A and porcine reproductive and respiratory syndrome viruses potentiates adaptive immune responses.猪流感 A 病毒和猪繁殖与呼吸综合征病毒的同时感染增强了适应性免疫反应。

Front Immunol. 2023 May 23;14:1192604. doi: 10.3389/fimmu.2023.1192604. eCollection 2023.

The relationship between case-control differential gene expression from brain tissue and genetic associations in schizophrenia.精神分裂症脑组织病例对照差异基因表达与遗传关联之间的关系。

Am J Med Genet B Neuropsychiatr Genet. 2023 Jul-Sep;192(5-6):85-92. doi: 10.1002/ajmg.b.32931. Epub 2023 Jan 18.

aTAP: automated transcriptome analysis platform for processing RNA-seq data by assembly.aTAP：用于通过组装处理RNA测序数据的自动化转录组分析平台。

Heliyon. 2022 Aug 15;8(8):e10255. doi: 10.1016/j.heliyon.2022.e10255. eCollection 2022 Aug.

Epitranscriptomics of Ischemic Heart Disease-The IHD-EPITRAN Study Design and Objectives.缺血性心脏病的表观转录组学 - IHD-EPITRAN 研究设计与目标。

Int J Mol Sci. 2021 Jun 21;22(12):6630. doi: 10.3390/ijms22126630.

Regulation of gene expression in the bovine blastocyst by colony-stimulating factor 2 is disrupted by CRISPR/Cas9-mediated deletion of CSF2RA.CRISPR/Cas9 介导的 CSF2RA 缺失破坏了牛囊胚中集落刺激因子 2 对基因表达的调控。

Biol Reprod. 2021 May 7;104(5):995-1007. doi: 10.1093/biolre/ioab015.

Benchmarking RNA-seq differential expression analysis methods using spike-in and simulation data.使用 Spike-in 和模拟数据进行 RNA-seq 差异表达分析方法的基准测试。

PLoS One. 2020 Apr 30;15(4):e0232271. doi: 10.1371/journal.pone.0232271. eCollection 2020.

Evidence against tetrapod-wide digit identities and for a limited frame shift in bird wings.证据表明四足动物的指骨没有统一身份，鸟类翅膀的指骨发生了有限的移位。

Nat Commun. 2019 Jul 19;10(1):3244. doi: 10.1038/s41467-019-11215-8.

本文引用的文献

Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates.利用小样本重复改进RNA测序数据的基因集富集分析

PLoS One. 2016 Nov 9;11(11):e0165919. doi: 10.1371/journal.pone.0165919. eCollection 2016.

Comparison of methods to detect differentially expressed genes between single-cell populations.单细胞群体间差异表达基因检测方法的比较。

Brief Bioinform. 2017 Sep 1;18(5):735-743. doi: 10.1093/bib/bbw057.

RNA-Enrich: a cut-off free functional enrichment testing method for RNA-seq with improved detection power.RNA-Enrich：一种用于RNA测序的无阈值功能富集测试方法，具有更高的检测能力。

Bioinformatics. 2016 Apr 1;32(7):1100-2. doi: 10.1093/bioinformatics/btv694. Epub 2015 Nov 25.

Chromatin interaction analysis reveals changes in small chromosome and telomere clustering between epithelial and breast cancer cells.染色质相互作用分析揭示了上皮细胞和乳腺癌细胞之间小染色体和端粒聚集的变化。

Genome Biol. 2015 Sep 28;16:214. doi: 10.1186/s13059-015-0768-0.

Effect of the absolute statistic on gene-sampling gene-set analysis methods.绝对统计量对基因抽样基因集分析方法的影响。

Stat Methods Med Res. 2017 Jun;26(3):1248-1260. doi: 10.1177/0962280215574014. Epub 2015 Mar 2.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.使用DESeq2对RNA测序数据的倍数变化和离散度进行适度估计。

Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8.

Comparative evaluation of gene set analysis approaches for RNA-Seq data.RNA测序数据基因集分析方法的比较评估

BMC Bioinformatics. 2014 Dec 5;15(1):397. doi: 10.1186/s12859-014-0397-8.

Power analysis and sample size estimation for RNA-Seq differential expression.RNA测序差异表达的功效分析与样本量估计

RNA. 2014 Nov;20(11):1684-96. doi: 10.1261/rna.046011.114. Epub 2014 Sep 22.

GSAASeqSP: a toolset for gene set association analysis of RNA-Seq data.GSAASeqSP：一种用于RNA测序数据基因集关联分析的工具集。

Sci Rep. 2014 Sep 12;4:6347. doi: 10.1038/srep06347.

SeqGSEA: a Bioconductor package for gene set enrichment analysis of RNA-Seq data integrating differential expression and splicing.SeqGSEA：一个用于 RNA-Seq 数据基因集富集分析的 Bioconductor 软件包，集成了差异表达和剪接分析。

Bioinformatics. 2014 Jun 15;30(12):1777-9. doi: 10.1093/bioinformatics/btu090. Epub 2014 Feb 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基因离散度是RNA-seq数据差异表达分析中读取计数偏差的关键决定因素。

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献