Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N, Wolfe St,, Baltimore, MD, USA.
BMC Bioinformatics. 2011 May 7;12:137. doi: 10.1186/1471-2105-12-137.
Microarray technology has become a widely used tool in the biological sciences. Over the past decade, the number of users has grown exponentially, and with the number of applications and secondary data analyses rapidly increasing, we expect this rate to continue. Various initiatives such as the External RNA Control Consortium (ERCC) and the MicroArray Quality Control (MAQC) project have explored ways to provide standards for the technology. For microarrays to become generally accepted as a reliable technology, statistical methods for assessing quality will be an indispensable component; however, there remains a lack of consensus in both defining and measuring microarray quality.
We begin by providing a precise definition of microarray quality and reviewing existing Affymetrix GeneChip quality metrics in light of this definition. We show that the best-performing metrics require multiple arrays to be assessed simultaneously. While such multi-array quality metrics are adequate for bench science, as microarrays begin to be used in clinical settings, single-array quality metrics will be indispensable. To this end, we define a single-array version of one of the best multi-array quality metrics and show that this metric performs as well as the best multi-array metrics. We then use this new quality metric to assess the quality of microarry data available via the Gene Expression Omnibus (GEO) using more than 22,000 Affymetrix HGU133a and HGU133plus2 arrays from 809 studies.
We find that approximately 10 percent of these publicly available arrays are of poor quality. Moreover, the quality of microarray measurements varies greatly from hybridization to hybridization, study to study, and lab to lab, with some experiments producing unusable data. Many of the concepts described here are applicable to other high-throughput technologies.
微阵列技术已成为生物科学中广泛使用的工具。在过去的十年中,用户数量呈指数级增长,随着应用程序和二次数据分析的数量迅速增加,我们预计这一速度将继续下去。各种倡议,如外部 RNA 控制联盟(ERCC)和微阵列质量控制(MAQC)项目,都在探索为该技术提供标准的方法。为了使微阵列成为普遍接受的可靠技术,评估质量的统计方法将是不可或缺的组成部分;然而,在定义和衡量微阵列质量方面仍然缺乏共识。
我们首先提供了微阵列质量的精确定义,并根据该定义审查了现有的 Affymetrix GeneChip 质量指标。我们表明,性能最佳的指标需要同时评估多个阵列。虽然这种多阵列质量指标对于基础科学是足够的,但随着微阵列开始在临床环境中使用,单阵列质量指标将是不可或缺的。为此,我们定义了一个最佳多阵列质量指标的单阵列版本,并表明该指标的性能与最佳多阵列指标相当。然后,我们使用这个新的质量指标来评估通过基因表达综合数据库(GEO)获得的微阵列数据的质量,使用来自 809 项研究的超过 22000 个 Affymetrix HGU133a 和 HGU133plus2 阵列。
我们发现,这些公开可用的阵列中约有 10%的质量较差。此外,微阵列测量的质量在杂交到杂交、研究到研究、实验室到实验室之间差异很大,有些实验产生了不可用的数据。这里描述的许多概念都适用于其他高通量技术。