Vrije Universiteit Brussel, Brussels, Belgium.
IEEE/ACM Trans Comput Biol Bioinform. 2013 Mar-Apr;10(2):383-92. doi: 10.1109/TCBB.2013.12.
The potential of microarray gene expression (MAGE) data is only partially explored due to the limited number of samples in individual studies. This limitation can be surmounted by merging or integrating data sets originating from independent MAGE experiments, which are designed to study the same biological problem. However, this process is hindered by batch effects that are study-dependent and result in random data distortion; therefore numerical transformations are needed to render the integration of different data sets accurate and meaningful. Our contribution in this paper is two-fold. First we propose GENESHIFT, a new nonparametric batch effect removal method based on two key elements from statistics: empirical density estimation and the inner product as a distance measure between two probability density functions; second we introduce a new validation index of batch effect removal methods based on the observation that samples from two independent studies drawn from a same population should exhibit similar probability density functions. We evaluated and compared the GENESHIFT method with four other state-of-the-art methods for batch effect removal: Batch-mean centering, empirical Bayes or COMBAT, distance-weighted discrimination, and cross-platform normalization. Several validation indices providing complementary information about the efficiency of batch effect removal methods have been employed in our validation framework. The results show that none of the methods clearly outperforms the others. More than that, most of the methods used for comparison perform very well with respect to some validation indices while performing very poor with respect to others. GENESHIFT exhibits robust performances and its average rank is the highest among the average ranks of all methods used for comparison.
由于单个研究中的样本数量有限,微阵列基因表达 (MAGE) 数据的潜力尚未得到充分挖掘。通过合并或整合来自独立 MAGE 实验的数据,可以克服这一限制,这些实验旨在研究相同的生物学问题。然而,这一过程受到批次效应的阻碍,批次效应是依赖于研究的,会导致随机数据扭曲;因此,需要进行数值转换,以使不同数据集的整合准确且有意义。本文的贡献有两点。首先,我们提出了 GENESHIFT,这是一种基于统计学中的两个关键元素的新的非参数批次效应去除方法:经验密度估计和内积作为两个概率密度函数之间的距离度量;其次,我们引入了一种新的批量效应去除方法的验证指标,基于这样一个观察结果:来自同一总体的两个独立研究的样本应该表现出相似的概率密度函数。我们评估并比较了 GENESHIFT 方法与其他四种用于去除批量效应的最先进方法:批量均值中心化、经验贝叶斯或 COMBAT、距离加权判别和跨平台归一化。我们的验证框架采用了多个提供有关批量效应去除方法效率的补充信息的验证指标。结果表明,没有一种方法明显优于其他方法。更重要的是,与比较中使用的大多数方法相比,大多数方法在某些验证指标上表现非常好,而在其他指标上表现非常差。GENESHIFT 表现出稳健的性能,其平均排名在所有比较方法的平均排名中最高。