Zhang Yiwen, Zhou Hua, Zhou Jin, Sun Wei
Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203.
Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095-1772.
J Comput Graph Stat. 2017;26(1):1-13. doi: 10.1080/10618600.2016.1154063. Epub 2017 Feb 16.
Data with multivariate count responses frequently occur in modern applications. The commonly used multinomial-logit model is limiting due to its restrictive mean-variance structure. For instance, analyzing count data from the recent RNA-seq technology by the multinomial-logit model leads to serious errors in hypothesis testing. The ubiquity of over-dispersion and complicated correlation structures among multivariate counts calls for more flexible regression models. In this article, we study some generalized linear models that incorporate various correlation structures among the counts. Current literature lacks a treatment of these models, partly due to the fact that they do not belong to the natural exponential family. We study the estimation, testing, and variable selection for these models in a unifying framework. The regression models are compared on both synthetic and real RNA-seq data.
具有多元计数响应的数据在现代应用中经常出现。常用的多项逻辑回归模型由于其受限的均值 - 方差结构而具有局限性。例如,用多项逻辑回归模型分析来自最新RNA测序技术的计数数据会在假设检验中导致严重错误。多元计数中过度离散和复杂相关结构的普遍存在需要更灵活的回归模型。在本文中,我们研究了一些纳入计数之间各种相关结构的广义线性模型。当前文献缺乏对这些模型的处理,部分原因是它们不属于自然指数族。我们在一个统一的框架中研究这些模型的估计、检验和变量选择。在合成数据和真实RNA测序数据上对回归模型进行了比较。