Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, USA.
Envirome Institute, University of Louisville, Louisville, KY, USA.
BMC Genomics. 2020 Jan 28;21(1):75. doi: 10.1186/s12864-020-6502-7.
High-throughput RNA sequencing (RNA-seq) has evolved as an important analytical tool in molecular biology. Although the utility and importance of this technique have grown, uncertainties regarding the proper analysis of RNA-seq data remain. Of primary concern, there is no consensus regarding which normalization and statistical methods are the most appropriate for analyzing this data. The lack of standardized analytical methods leads to uncertainties in data interpretation and study reproducibility, especially with studies reporting high false discovery rates. In this study, we compared a recently developed normalization method, UQ-pgQ2, with three of the most frequently used alternatives including RLE (relative log estimate), TMM (Trimmed-mean M values) and UQ (upper quartile normalization) in the analysis of RNA-seq data. We evaluated the performance of these methods for gene-level differential expression analysis by considering the factors, including: 1) normalization combined with the choice of a Wald test from DESeq2 and an exact test/QL (Quasi-likelihood) F-Test from edgeR; 2) sample sizes in two balanced two-group comparisons; and 3) sequencing read depths.
Using the MAQC RNA-seq datasets with small sample replicates, we found that UQ-pgQ2 normalization combined with an exact test can achieve better performance in term of power and specificity in differential gene expression analysis. However, using an intra-group analysis of false positives from real and simulated data, we found that a Wald test performs better than an exact test when the number of sample replicates is large and that a QL F-test performs the best given sample sizes of 5, 10 and 15 for any normalization. The RLE, TMM and UQ methods performed similarly given a desired sample size.
We found the UQ-pgQ2 method combined with an exact test/QL F-test is the best choice in order to control false positives when the sample size is small. When the sample size is large, UQ-pgQ2 with a QL F-test is a better choice for the type I error control in an intra-group analysis. We observed read depths have a minimal impact for differential gene expression analysis based on the simulated data.
高通量 RNA 测序(RNA-seq)已成为分子生物学中一种重要的分析工具。尽管该技术的实用性和重要性不断提高,但在分析 RNA-seq 数据方面仍存在一些不确定因素。首要关注的是,对于最适合分析这种数据的标准化分析方法尚未达成共识。缺乏标准化的分析方法导致数据解释和研究再现性存在不确定性,尤其是在报告高假发现率的研究中。在这项研究中,我们比较了一种新开发的归一化方法 UQ-pgQ2,以及三种最常用的替代方法,包括 RLE(相对对数估计)、TMM(Trimmed-mean M 值)和 UQ(上四分位数归一化),用于分析 RNA-seq 数据。我们通过考虑以下因素,评估了这些方法在基因水平差异表达分析中的性能:1)归一化与从 DESeq2 选择 Wald 检验和从 edgeR 选择确切检验/QL(拟似然)F 检验相结合;2)两个平衡两组比较的样本量;3)测序读深度。
使用具有小样本重复的 MAQC RNA-seq 数据集,我们发现 UQ-pgQ2 归一化与确切检验相结合在差异基因表达分析中的功效和特异性方面具有更好的性能。然而,通过对真实和模拟数据的组内假阳性分析,我们发现当样本重复数较大时,Wald 检验比确切检验表现更好,而对于任何归一化,当样本量为 5、10 和 15 时,QL F 检验表现最好。在给定所需样本量的情况下,RLE、TMM 和 UQ 方法的性能相似。
当样本量较小时,我们发现 UQ-pgQ2 方法与确切检验/QL F 检验相结合是控制假阳性的最佳选择。当样本量较大时,对于组内分析中的 I 型错误控制,UQ-pgQ2 与 QL F 检验是更好的选择。我们观察到基于模拟数据,读深度对差异基因表达分析的影响最小。