Zyprych-Walczak J, Szabelska A, Handschuh L, Górczak K, Klamecka K, Figlerowicz M, Siatkowski I
Department of Mathematical and Statistical Methods, Poznan University of Life Sciences, 60-637 Poznan, Poland.
Institute of Bioorganic Chemistry, Polish Academy of Sciences, 61-704 Poznan, Poland ; Department of Hematology and Bone Marrow Transplantation, Poznan University of Medical Sciences, 60-569 Poznan, Poland.
Biomed Res Int. 2015;2015:621690. doi: 10.1155/2015/621690. Epub 2015 Jun 15.
High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.
高通量测序技术,如Illumina Hi-seq,是用于研究广泛的生物学和医学问题的强大新工具。测序仪产生的海量复杂数据集催生了对能够处理数据分析和管理的统计及计算方法的需求。数据归一化是数据处理中最关键的步骤之一,由于它对分析结果有深远影响,因此必须仔细考虑。在这项工作中,我们着重对广泛用于转录组测序(RNA-seq)数据的与测序深度相关的五种归一化方法进行全面比较,以及它们对基因表达分析结果的影响。基于这项研究,我们提出了一个通用工作流程,可用于为任何特定数据集选择最佳归一化程序。所描述的工作流程包括计算对照基因的偏差和方差值、方法的灵敏度和特异性、分类错误以及生成诊断图。综合上述信息有助于为研究数据集选择最合适的归一化方法,并确定哪些方法可以互换使用。