Suppr超能文献

RNA测序分析中病例对照研究的逻辑回归模型评估及协变量效应

Evaluation of logistic regression models and effect of covariates for case-control study in RNA-Seq analysis.

作者信息

Choi Seung Hoan, Labadorf Adam T, Myers Richard H, Lunetta Kathryn L, Dupuis Josée, DeStefano Anita L

机构信息

Department of Biostatistics, Boston University, 801 Massachusetts Avenue, Boston, Massachusetts, USA.

Department of Neurology, Boston University, 72 East Concord Street, Boston, Massachusetts, USA.

出版信息

BMC Bioinformatics. 2017 Feb 6;18(1):91. doi: 10.1186/s12859-017-1498-y.

Abstract

BACKGROUND

Next generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNA-Seq) data, its appropriateness has not been exhaustively evaluated. We explore logistic regression as an alternative method for RNA-Seq studies designed to compare cases and controls, where disease status is modeled as a function of RNA-Seq reads using simulated and Huntington disease data. We evaluate the effect of adjusting for covariates that have an unknown relationship with gene expression. Finally, we incorporate the data adaptive method in order to compare false positive rates.

RESULTS

When the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated Type-I error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth's logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make Type-I error rates of all methods close to nominal alpha levels of 0.05 and 0.01. However, Type-I error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 fold-change, and low dispersion. The FL regression has comparable power to NB regression.

CONCLUSIONS

We conclude that implementing the data adaptive method appropriately controls Type-I error rates in RNA-Seq analysis. Firth's logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework.

摘要

背景

新一代测序以短读长的形式提供RNA分子计数,产生离散的、通常高度非正态分布的基因表达测量值。尽管负二项式(NB)回归在RNA测序(RNA-Seq)数据分析中已被普遍接受,但其适用性尚未得到详尽评估。我们探索逻辑回归作为RNA-Seq研究的一种替代方法,该研究旨在比较病例组和对照组,其中疾病状态被建模为RNA-Seq读数的函数,使用模拟数据和亨廷顿病数据。我们评估了对与基因表达关系未知的协变量进行调整的效果。最后,我们纳入数据自适应方法以比较假阳性率。

结果

当样本量较小或基因表达水平高度分散时,NB回归显示出膨胀的I型错误率,但经典逻辑回归和贝叶斯逻辑(BL)回归较为保守。Firth逻辑(FL)回归表现良好或略显保守。大样本量和低离散度通常使所有方法的I型错误率接近名义α水平0.05和0.01。然而,应用数据自适应方法后I型错误率得到控制。NB、BL和FL回归在大样本量、大log2倍变化和低离散度时获得更高的检验效能。FL回归与NB回归具有相当的检验效能。

结论

我们得出结论,在RNA-Seq分析中适当实施数据自适应方法可控制I型错误率。Firth逻辑回归提供了简洁的统计推断过程,并减少了负二项式框架中因分散参数估计不准确而产生的虚假关联。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cca2/5294900/68ac800be6fd/12859_2017_1498_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验