Kadota Koji, Nishiyama Tomoaki, Shimizu Kentaro
Agricultural Bioinformatics Research Unit, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan.
Algorithms Mol Biol. 2012 Apr 5;7(1):5. doi: 10.1186/1748-7188-7-5.
High-throughput sequencing, such as ribonucleic acid sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) analyses, enables various features of organisms to be compared through tag counts. Recent studies have demonstrated that the normalization step for RNA-seq data is critical for a more accurate subsequent analysis of differential gene expression. Development of a more robust normalization method is desirable for identifying the true difference in tag count data.
We describe a strategy for normalizing tag count data, focusing on RNA-seq. The key concept is to remove data assigned as potential differentially expressed genes (DEGs) before calculating the normalization factor. Several R packages for identifying DEGs are currently available, and each package uses its own normalization method and gene ranking algorithm. We compared a total of eight package combinations: four R packages (edgeR, DESeq, baySeq, and NBPSeq) with their default normalization settings and with our normalization strategy. Many synthetic datasets under various scenarios were evaluated on the basis of the area under the curve (AUC) as a measure for both sensitivity and specificity. We found that packages using our strategy in the data normalization step overall performed well. This result was also observed for a real experimental dataset.
Our results showed that the elimination of potential DEGs is essential for more accurate normalization of RNA-seq data. The concept of this normalization strategy can widely be applied to other types of tag count data and to microarray data.
高通量测序,如核糖核酸测序(RNA-seq)和染色质免疫沉淀测序(ChIP-seq)分析,能够通过标签计数比较生物体的各种特征。最近的研究表明,RNA-seq数据的标准化步骤对于后续更准确的差异基因表达分析至关重要。开发一种更强大的标准化方法对于识别标签计数数据中的真正差异是很有必要的。
我们描述了一种用于标准化标签计数数据的策略,重点是RNA-seq。关键概念是在计算标准化因子之前去除被指定为潜在差异表达基因(DEG)的数据。目前有几个用于识别DEG的R包,每个包都使用自己的标准化方法和基因排名算法。我们总共比较了八种包组合:四个R包(edgeR、DESeq、baySeq和NBPSeq)及其默认标准化设置和我们的标准化策略。基于曲线下面积(AUC)对各种场景下的许多合成数据集进行了评估,以此作为灵敏度和特异性的度量。我们发现,在数据标准化步骤中使用我们策略的包总体表现良好。在一个真实的实验数据集上也观察到了这一结果。
我们的结果表明,消除潜在的DEG对于更准确地标准化RNA-seq数据至关重要。这种标准化策略的概念可以广泛应用于其他类型的标签计数数据和微阵列数据。