IEEE/ACM Trans Comput Biol Bioinform. 2019 Mar-Apr;16(2):442-454. doi: 10.1109/TCBB.2018.2790918. Epub 2018 Jan 8.
The RNA-sequencing (RNA-seq) is becoming increasingly popular for quantifying gene expression levels. Since the RNA-seq measurements are relative in nature, between-sample normalization is an essential step in differential expression (DE) analysis. The normalization step of existing DE detection algorithms is usually ad hoc and performed only once prior to DE detection, which may be suboptimal since ideally normalization should be based on non-DE genes only and thus coupled with DE detection. We propose a unified statistical model for joint normalization and DE detection of RNA-seq data. Sample-specific normalization factors are modeled as unknown parameters in the gene-wise linear models and jointly estimated with the regression coefficients. By imposing sparsity-inducing L1 penalty (or mixed L1/L2 penalty for multiple treatment conditions) on the regression coefficients, we formulate the problem as a penalized least-squares regression problem and apply the augmented Lagrangian method to solve it. Simulation and real data studies show that the proposed model and algorithms perform better than or comparably to existing methods in terms of detection power and false-positive rate. The performance gain increases with increasingly larger sample size or higher signal to noise ratio, and is more significant when a large proportion of genes are differentially expressed in an asymmetric manner.
RNA 测序(RNA-seq)在定量基因表达水平方面越来越受欢迎。由于 RNA-seq 测量具有相对性质,因此在差异表达(DE)分析中,样本间归一化是必不可少的步骤。现有 DE 检测算法的归一化步骤通常是特定于应用的,并且仅在 DE 检测之前执行一次,这可能不是最佳的,因为理想情况下,归一化应该仅基于非 DE 基因,并且因此与 DE 检测相关联。我们提出了一种用于联合归一化和 RNA-seq 数据 DE 检测的统一统计模型。在基因线性模型中,将特定于样本的归一化因子建模为未知参数,并与回归系数一起进行联合估计。通过对回归系数施加稀疏诱导 L1 惩罚(或用于多种处理条件的混合 L1/L2 惩罚),我们将问题表述为惩罚最小二乘回归问题,并应用增广拉格朗日方法来解决它。模拟和真实数据研究表明,与现有方法相比,所提出的模型和算法在检测能力和假阳性率方面表现更好或相当。随着样本量或信噪比的增加,性能增益增加,并且当大量基因以非对称方式表现出差异表达时,性能增益更为显著。