Cui Zihan, Liu Yuhang, Zhang Jinfeng, Qiu Xing
Department of Statistics, Florida State University, Tallahassee, FL, 32304, USA.
Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, 14624, USA.
Bioinformatics. 2021 Sep 9;37(17):2627-2636. doi: 10.1093/bioinformatics/btab155.
We developed super-delta2, a differential gene expression analysis pipeline designed for multi-group comparisons for RNA-seq data. It includes a customized one-way ANOVA F-test and a post-hoc test for pairwise group comparisons; both are designed to work with a multivariate normalization procedure to reduce technical noise. It also includes a trimming procedure with bias-correction to obtain robust and approximately unbiased summary statistics used in these tests. We demonstrated the asymptotic applicability of super-delta2 to log-transformed read counts in RNA-seq data by large sample theory based on Negative Binomial Poisson (NBP) distribution.
We compared super-delta2 with three commonly used RNA-seq data analysis methods: limma/voom, edgeR and DESeq2 using both simulated and real datasets. In all three simulation settings, super-delta2 not only achieved the best overall statistical power, but also was the only method that controlled type I error at the nominal level. When applied to a breast cancer dataset to identify differential expression pattern associated with multiple pathologic stages, super-delta2 selected more enriched pathways than other methods, which are directly linked to the underlying biological condition (breast cancer).
In conclusion, by incorporating trimming and bias-correction in the normalization step, super-delta2 was able to achieve tight control of type I error. Because the hypothesis tests are based on asymptotic normal approximation of the NBP distribution, super-delta2 does not require computationally expensive iterative optimization procedures used by methods such as edgeR and DESeq2, which occasionally have convergence issues.
Our method is implemented in a R-package, 'superdelta2', freely available at: https://github.com/fhlsjs/superdelta2.
Supplementary data are available at Bioinformatics online.
我们开发了super - delta2,这是一种用于RNA测序数据多组比较的差异基因表达分析流程。它包括一个定制的单向方差分析F检验和用于成对组比较的事后检验;两者都设计为与多元归一化程序配合使用,以减少技术噪声。它还包括一个带有偏差校正的修剪程序,以获得这些检验中使用的稳健且近似无偏的汇总统计量。我们基于负二项泊松(NBP)分布的大样本理论,证明了super - delta2对RNA测序数据中对数转换后的读数计数的渐近适用性。
我们使用模拟数据集和真实数据集,将super - delta2与三种常用的RNA测序数据分析方法进行了比较:limma/voom、edgeR和DESeq2。在所有三种模拟设置中,super - delta2不仅实现了最佳的总体统计功效,而且是唯一能将I型错误控制在名义水平的方法。当应用于乳腺癌数据集以识别与多个病理阶段相关的差异表达模式时,super - delta2比其他方法选择了更多与潜在生物学状况(乳腺癌)直接相关的富集通路。
总之,通过在归一化步骤中纳入修剪和偏差校正,super - delta2能够严格控制I型错误。由于假设检验基于NBP分布的渐近正态近似,super - delta2不需要像edgeR和DESeq2等方法那样使用计算成本高昂的迭代优化程序,而这些方法偶尔会出现收敛问题。
我们的方法在一个R包“superdelta2”中实现,可在https://github.com/fhlsjs/superdelta2上免费获取。
补充数据可在《生物信息学》在线获取。