Li Xiaohong, Brock Guy N, Rouchka Eric C, Cooper Nigel G F, Wu Dongfeng, O'Toole Timothy E, Gill Ryan S, Eteleeb Abdallah M, O'Brien Liz, Rai Shesh N
Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, United States of America.
Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, United States of America.
PLoS One. 2017 May 1;12(5):e0176185. doi: 10.1371/journal.pone.0176185. eCollection 2017.
Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (<70%) and a slightly higher actual FDR than our proposed methods. In addition, the results from an analysis based on the qualitative characteristics of sample distribution for MAQC2 and human breast cancer datasets show that only our gene-wise normalization methods corrected data skewed towards lower read counts. However, when we evaluated MAQC3 with less variation in five replicates, all methods performed similarly. Thus, our proposed Med-pgQ2 and UQ-pgQ2 methods perform slightly better for differential gene analysis of RNA-seq data skewed towards lowly expressed read counts with high variation by improving specificity while maintaining a good detection power with a control of the nominal FDR level.
标准化是高通量RNA测序(RNA-seq)数据分析中至关重要的一步,对数据分析有重大影响。尽管有众多用于读取计数标准化的方法,但由于多种因素导致读取计数变异性,进而影响整体灵敏度和特异性,因此选择最佳方法仍然是一项挑战。为了正确确定最合适的标准化方法,基于不同数据集特征比较一组代表性标准化程序的性能和缺点至关重要。因此,我们着手评估常用方法(DESeq、TMM-edgeR、FPKM-CuffDiff、TC、Med UQ和FQ)以及我们提出的两种新方法:Med-pgQ2和UQ-pgQ2(每个样本中位数或上四分位数全局缩放后的每个基因标准化)。我们的每个基因标准化方法允许基于相似计数水平对不同条件进行比较。使用基准微阵列质量控制项目(MAQC)和模拟数据集,我们进行了差异基因表达分析以评估这些方法。在评估有两个重复样本的MAQC2时,我们观察到Med-pgQ2和UQ-pgQ2在接受者操作特征曲线(AUC)下的面积略高,特异性率>85%,检测力>92%,并且在名义错误发现率(≤0.05)下实际错误发现率(FDR)低于0.06。尽管常用的顶级方法(DESeq和TMM-edgeR)对MAQC2数据具有更高的检测力(>93%),但它们以降低的特异性(<70%)和比我们提出的方法略高的实际FDR为代价。此外,基于MAQC2和人类乳腺癌数据集样本分布的定性特征进行分析的结果表明,只有我们的基因层面标准化方法校正了偏向较低读取计数的数据偏差。然而,当我们评估有较小变异的五个重复样本的MAQC3时,所有方法表现相似。因此,我们提出的Med-pgQ2和UQ-pgQ2方法在通过提高特异性同时控制名义FDR水平以保持良好检测力的情况下,对于偏向低表达读取计数且变异高的RNA-seq数据的差异基因分析表现略优。