Liu Li, Hawkins Douglas M, Ghosh Sujoy, Young S Stanley
National Institute of Statistical Sciences, P.O. Box 14006, Research Triangle Park, NC 27709-4006, USA.
Proc Natl Acad Sci U S A. 2003 Nov 11;100(23):13167-72. doi: 10.1073/pnas.1733249100. Epub 2003 Oct 27.
In microarray data there are a number of biological samples, each assessed for the level of gene expression for a typically large number of genes. There is a need to examine these data with statistical techniques to help discern possible patterns in the data. Our technique applies a combination of mathematical and statistical methods to progressively take the data set apart so that different aspects can be examined for both general patterns and very specific effects. Unfortunately, these data tables are often corrupted with extreme values (outliers), missing values, and non-normal distributions that preclude standard analysis. We develop a robust analysis method to address these problems. The benefits of this robust analysis will be both the understanding of large-scale shifts in gene effects and the isolation of particular sample-by-gene effects that might be either unusual interactions or the result of experimental flaws. Our method requires a single pass and does not resort to complex "cleaning" or imputation of the data table before analysis. We illustrate the method with a commercial data set.
在微阵列数据中,有许多生物样本,每个样本都针对大量基因的基因表达水平进行评估。需要使用统计技术来检查这些数据,以帮助识别数据中可能存在的模式。我们的技术应用数学和统计方法的组合,逐步剖析数据集,以便可以从总体模式和非常具体的效应两个方面来检查不同的方面。不幸的是,这些数据表经常被极端值(异常值)、缺失值和非正态分布所破坏,从而妨碍了标准分析。我们开发了一种稳健的分析方法来解决这些问题。这种稳健分析的好处在于既能理解基因效应的大规模变化,又能分离出特定的样本与基因效应,这些效应可能是异常相互作用或实验缺陷的结果。我们的方法只需一次遍历,在分析之前无需对数据表进行复杂的“清理”或插补。我们用一个商业数据集来说明该方法。