Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina, United States of America.
PLoS One. 2010 Mar 26;5(3):e9905. doi: 10.1371/journal.pone.0009905.
Contemporary high dimensional biological assays, such as mRNA expression microarrays, regularly involve multiple data processing steps, such as experimental processing, computational processing, sample selection, or feature selection (i.e. gene selection), prior to deriving any biological conclusions. These steps can dramatically change the interpretation of an experiment. Evaluation of processing steps has received limited attention in the literature. It is not straightforward to evaluate different processing methods and investigators are often unsure of the best method. We present a simple statistical tool, Standardized WithIn class Sum of Squares (SWISS), that allows investigators to compare alternate data processing methods, such as different experimental methods, normalizations, or technologies, on a dataset in terms of how well they cluster a priori biological classes. SWISS uses Euclidean distance to determine which method does a better job of clustering the data elements based on a priori classifications. We apply SWISS to three different gene expression applications. The first application uses four different datasets to compare different experimental methods, normalizations, and gene sets. The second application, using data from the MicroArray Quality Control (MAQC) project, compares different microarray platforms. The third application compares different technologies: a single Agilent two-color microarray versus one lane of RNA-Seq. These applications give an indication of the variety of problems that SWISS can be helpful in solving. The SWISS analysis of one-color versus two-color microarrays provides investigators who use two-color arrays the opportunity to review their results in light of a single-channel analysis, with all of the associated benefits offered by this design. Analysis of the MACQ data shows differential intersite reproducibility by array platform. SWISS also shows that one lane of RNA-Seq clusters data by biological phenotypes as well as a single Agilent two-color microarray.
当代高维生物学分析方法,如 mRNA 表达微阵列,通常涉及多个数据处理步骤,如实验处理、计算处理、样本选择或特征选择(即基因选择),然后才能得出任何生物学结论。这些步骤可以极大地改变对实验的解释。处理步骤的评估在文献中受到的关注有限。评估不同的处理方法并不简单,研究人员通常也不确定哪种方法最好。我们提出了一种简单的统计工具,标准化组内平方和(SWISS),该工具允许研究人员根据事先的生物学分类,比较数据集上的替代数据处理方法,如不同的实验方法、归一化或技术,以评估它们聚类的效果。SWISS 使用欧几里得距离来确定哪种方法在基于事先分类的基础上更好地聚类数据元素。我们将 SWISS 应用于三个不同的基因表达应用中。第一个应用使用四个不同的数据集来比较不同的实验方法、归一化和基因集。第二个应用使用 MicroArray Quality Control (MAQC) 项目的数据,比较不同的微阵列平台。第三个应用比较不同的技术:一个单 Agilent 双色微阵列与一个 RNA-Seq 泳道。这些应用表明了 SWISS 可以帮助解决的各种问题。单、双色微阵列的 SWISS 分析为使用双色阵列的研究人员提供了一个机会,根据单通道分析重新审视他们的结果,同时提供这种设计带来的所有好处。对 MAQC 数据的分析表明,不同的阵列平台具有不同的站点间可重复性。SWISS 还表明,一个 RNA-Seq 泳道可以像单 Agilent 双色微阵列一样,根据生物学表型聚类数据。