Cohen Freue Gabriela V, Hollander Zsuzsanna, Shen Enqing, Zamar Ruben H, Balshaw Robert, Scherer Andreas, McManus Bruce, Keown Paul, McMaster W Robert, Ng Raymond T
Department of Computer Science, University of British Columbia, Vancouver, British Columbia, Canada.
Bioinformatics. 2007 Dec 1;23(23):3162-9. doi: 10.1093/bioinformatics/btm487. Epub 2007 Oct 12.
The process of producing microarray data involves multiple steps, some of which may suffer from technical problems and seriously damage the quality of the data. Thus, it is essential to identify those arrays with low quality. This article addresses two questions: (1) how to assess the quality of a microarray dataset using the measures provided in quality control (QC) reports; (2) how to identify possible sources of the quality problems.
We propose a novel multivariate approach to evaluate the quality of an array that examines the 'Mahalanobis distance' of its quality attributes from those of other arrays. Thus, we call it Mahalanobis Distance Quality Control (MDQC) and examine different approaches of this method. MDQC flags problematic arrays based on the idea of outlier detection, i.e. it flags those arrays whose quality attributes jointly depart from those of the bulk of the data. Using two case studies, we show that a multivariate analysis gives substantially richer information than analyzing each parameter of the QC report in isolation. Moreover, once the QC report is produced, our quality assessment method is computationally inexpensive and the results can be easily visualized and interpreted. Finally, we show that computing these distances on subsets of the quality measures in the report may increase the method's ability to detect unusual arrays and helps to identify possible reasons of the quality problems.
The library to implement MDQC will soon be available from Bioconductor.
微阵列数据的生成过程涉及多个步骤,其中一些步骤可能会出现技术问题并严重损害数据质量。因此,识别那些质量低的阵列至关重要。本文探讨两个问题:(1)如何使用质量控制(QC)报告中提供的指标评估微阵列数据集的质量;(2)如何识别质量问题的可能来源。
我们提出了一种新颖的多变量方法来评估阵列的质量,该方法检查其质量属性与其他阵列质量属性的“马氏距离”。因此,我们将其称为马氏距离质量控制(MDQC)并研究该方法的不同途径。MDQC基于异常值检测的思想标记有问题的阵列,即它标记那些质量属性共同偏离大部分数据质量属性的阵列。通过两个案例研究,我们表明多变量分析比单独分析QC报告的每个参数能提供丰富得多的信息。此外,一旦生成QC报告,我们的质量评估方法计算成本低,结果易于可视化和解释。最后,我们表明在报告中的质量指标子集上计算这些距离可能会提高该方法检测异常阵列的能力,并有助于识别质量问题的可能原因。
实现MDQC的库将很快在Bioconductor上提供。