RNA测序分析中病例对照研究的逻辑回归模型评估及协变量效应

Evaluation of logistic regression models and effect of covariates for case-control study in RNA-Seq analysis.

作者信息

Choi Seung Hoan, Labadorf Adam T, Myers Richard H, Lunetta Kathryn L, Dupuis Josée, DeStefano Anita L

机构信息

Department of Biostatistics, Boston University, 801 Massachusetts Avenue, Boston, Massachusetts, USA.

Department of Neurology, Boston University, 72 East Concord Street, Boston, Massachusetts, USA.

出版信息

BMC Bioinformatics. 2017 Feb 6;18(1):91. doi: 10.1186/s12859-017-1498-y.

DOI:10.1186/s12859-017-1498-y

PMID:28166718

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5294900/

Abstract

BACKGROUND

Next generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNA-Seq) data, its appropriateness has not been exhaustively evaluated. We explore logistic regression as an alternative method for RNA-Seq studies designed to compare cases and controls, where disease status is modeled as a function of RNA-Seq reads using simulated and Huntington disease data. We evaluate the effect of adjusting for covariates that have an unknown relationship with gene expression. Finally, we incorporate the data adaptive method in order to compare false positive rates.

RESULTS

When the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated Type-I error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth's logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make Type-I error rates of all methods close to nominal alpha levels of 0.05 and 0.01. However, Type-I error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 fold-change, and low dispersion. The FL regression has comparable power to NB regression.

CONCLUSIONS

We conclude that implementing the data adaptive method appropriately controls Type-I error rates in RNA-Seq analysis. Firth's logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework.

摘要

背景

新一代测序以短读长的形式提供RNA分子计数，产生离散的、通常高度非正态分布的基因表达测量值。尽管负二项式（NB）回归在RNA测序（RNA-Seq）数据分析中已被普遍接受，但其适用性尚未得到详尽评估。我们探索逻辑回归作为RNA-Seq研究的一种替代方法，该研究旨在比较病例组和对照组，其中疾病状态被建模为RNA-Seq读数的函数，使用模拟数据和亨廷顿病数据。我们评估了对与基因表达关系未知的协变量进行调整的效果。最后，我们纳入数据自适应方法以比较假阳性率。

结果

当样本量较小或基因表达水平高度分散时，NB回归显示出膨胀的I型错误率，但经典逻辑回归和贝叶斯逻辑（BL）回归较为保守。Firth逻辑（FL）回归表现良好或略显保守。大样本量和低离散度通常使所有方法的I型错误率接近名义α水平0.05和0.01。然而，应用数据自适应方法后I型错误率得到控制。NB、BL和FL回归在大样本量、大log2倍变化和低离散度时获得更高的检验效能。FL回归与NB回归具有相当的检验效能。

结论

我们得出结论，在RNA-Seq分析中适当实施数据自适应方法可控制I型错误率。Firth逻辑回归提供了简洁的统计推断过程，并减少了负二项式框架中因分散参数估计不准确而产生的虚假关联。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cca2/5294900/68ac800be6fd/12859_2017_1498_Fig1_HTML.jpg

相似文献

Evaluation of logistic regression models and effect of covariates for case-control study in RNA-Seq analysis.

BMC Bioinformatics. 2017 Feb 6;18(1):91. doi: 10.1186/s12859-017-1498-y.

Goodness-of-fit tests and model diagnostics for negative binomial regression of RNA sequencing data.

PLoS One. 2015 Mar 18;10(3):e0119254. doi: 10.1371/journal.pone.0119254. eCollection 2015.

Power analysis and sample size estimation for RNA-Seq differential expression.

RNA. 2014 Nov;20(11):1684-96. doi: 10.1261/rna.046011.114. Epub 2014 Sep 22.

Firth's logistic regression with rare events: accurate effect estimates and predictions?

Stat Med. 2017 Jun 30;36(14):2302-2317. doi: 10.1002/sim.7273. Epub 2017 Mar 12.

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data.

BMC Bioinformatics. 2016 Sep 13;17(1):369. doi: 10.1186/s12859-016-1208-1.

MCMSeq: Bayesian hierarchical modeling of clustered and repeated measures RNA sequencing experiments.

BMC Bioinformatics. 2020 Aug 28;21(1):375. doi: 10.1186/s12859-020-03715-y.

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads.

BMC Genomics. 2015;16 Suppl 7(Suppl 7):S14. doi: 10.1186/1471-2164-16-S7-S14. Epub 2015 Jun 11.

The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

PLoS One. 2015 Apr 7;10(4):e0120117. doi: 10.1371/journal.pone.0120117. eCollection 2015.

Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies.

Microbiome. 2016 Nov 25;4(1):62. doi: 10.1186/s40168-016-0208-8.

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data.

BMC Genomics. 2017 May 25;18(1):408. doi: 10.1186/s12864-017-3809-0.

引用本文的文献

Novel insights into post-myocardial infarction cardiac remodeling through algorithmic detection of cell-type composition shifts.

PLoS Genet. 2025 Jul 24;21(7):e1011807. doi: 10.1371/journal.pgen.1011807. eCollection 2025 Jul.

A benchmark of RNA-seq data normalization methods for transcriptome mapping on human genome-scale metabolic networks.

NPJ Syst Biol Appl. 2024 Oct 24;10(1):124. doi: 10.1038/s41540-024-00448-z.

Generative language models on nucleotide sequences of human genes.

Sci Rep. 2024 Sep 27;14(1):22204. doi: 10.1038/s41598-024-72512-x.

Novel Insights into Post-Myocardial Infarction Cardiac Remodeling through Algorithmic Detection of Cell-Type Composition Shifts.

bioRxiv. 2024 Aug 10:2024.08.09.607400. doi: 10.1101/2024.08.09.607400.

AITeQ: a machine learning framework for Alzheimer's prediction using a distinctive five-gene signature.

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae291.

SCInter: A comprehensive single-cell transcriptome integration database for human and mouse.

Comput Struct Biotechnol J. 2023 Nov 15;23:77-86. doi: 10.1016/j.csbj.2023.11.024. eCollection 2024 Dec.

Correlation Analysis of Molecularly-Defined Cortical Interneuron Populations with Morpho-Electric Properties in Layer V of Mouse Neocortex.

Neurosci Bull. 2023 Jul;39(7):1069-1086. doi: 10.1007/s12264-022-00983-x. Epub 2022 Nov 23.

Circulating microRNAs in seminal plasma as predictors of sperm retrieval in microdissection testicular sperm extraction.

Ann Transl Med. 2022 Apr;10(7):392. doi: 10.21037/atm-21-5100.

Validating machine learning models for the prediction of labour induction intervention using routine data: a registry-based retrospective cohort study at a tertiary hospital in northern Tanzania.

BMJ Open. 2021 Dec 2;11(12):e051925. doi: 10.1136/bmjopen-2021-051925.

Predictive network modeling in human induced pluripotent stem cells identifies key driver genes for insulin responsiveness.

PLoS Comput Biol. 2020 Dec 23;16(12):e1008491. doi: 10.1371/journal.pcbi.1008491. eCollection 2020 Dec.

本文引用的文献

RNA Sequence Analysis of Human Huntington Disease Brain Reveals an Extensive Increase in Inflammatory and Developmental Gene Expression.

PLoS One. 2015 Dec 4;10(12):e0143563. doi: 10.1371/journal.pone.0143563. eCollection 2015.

Evaluation of methods for differential expression analysis on multi-group RNA-seq count data.

BMC Bioinformatics. 2015 Nov 4;16:361. doi: 10.1186/s12859-015-0794-7.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8.

Association of Brain DNA methylation in SORL1, ABCA7, HLA-DRB5, SLC24A4, and BIN1 with pathological diagnosis of Alzheimer disease.

JAMA Neurol. 2015 Jan;72(1):15-24. doi: 10.1001/jamaneurol.2014.3049.

Building a pipeline to discover and validate novel therapeutic targets and lead compounds for Alzheimer's disease.

Biochem Pharmacol. 2014 Apr 15;88(4):617-30. doi: 10.1016/j.bcp.2014.01.037. Epub 2014 Feb 6.

voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.

Genome Biol. 2014 Feb 3;15(2):R29. doi: 10.1186/gb-2014-15-2-r29.

Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods.

PLoS One. 2013 Dec 9;8(12):e81415. doi: 10.1371/journal.pone.0081415. eCollection 2013.

Comparison of software packages for detecting differential expression in RNA-seq studies.

Brief Bioinform. 2015 Jan;16(1):59-70. doi: 10.1093/bib/bbt086. Epub 2013 Dec 2.

Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.

Genome Biol. 2013;14(9):R95. doi: 10.1186/gb-2013-14-9-r95.

TCC: an R package for comparing tag count data with robust normalization strategies.

BMC Bioinformatics. 2013 Jul 9;14:219. doi: 10.1186/1471-2105-14-219.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

RNA测序分析中病例对照研究的逻辑回归模型评估及协变量效应

Evaluation of logistic regression models and effect of covariates for case-control study in RNA-Seq analysis.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献