de Brevern Alexandre G, Hazout Serge, Malpertuy Alain
Equipe de Bioinformatique Génomique et Moléculaire (EBGM), INSERM E0346, Université Denis DIDEROT-Paris 7, case 7113, 2, place Jussieu, 75251 Paris, France.
BMC Bioinformatics. 2004 Aug 23;5:114. doi: 10.1186/1471-2105-5-114.
Microarray technologies produced large amount of data. The hierarchical clustering is commonly used to identify clusters of co-expressed genes. However, microarray datasets often contain missing values (MVs) representing a major drawback for the use of the clustering methods. Usually the MVs are not treated, or replaced by zero or estimated by the k-Nearest Neighbor (kNN) approach. The topic of the paper is to study the stability of gene clusters, defined by various hierarchical clustering algorithms, of microarrays experiments including or not MVs.
In this study, we show that the MVs have important effects on the stability of the gene clusters. Moreover, the magnitude of the gene misallocations is depending on the aggregation algorithm. The most appropriate aggregation methods (e.g. complete-linkage and Ward) are highly sensitive to MVs, and surprisingly, for a very tiny proportion of MVs (e.g. 1%). In most of the case, the MVs must be replaced by expected values. The MVs replacement by the kNN approach clearly improves the identification of co-expressed gene clusters. Nevertheless, we observe that kNN approach is less suitable for the extreme values of gene expression.
The presence of MVs (even at a low rate) is a major factor of gene cluster instability. In addition, the impact depends on the hierarchical clustering algorithm used. Some methods should be used carefully. Nevertheless, the kNN approach constitutes one efficient method for restoring the missing expression gene values, with a low error level. Our study highlights the need of statistical treatments in microarray data to avoid misinterpretation.
微阵列技术产生了大量数据。层次聚类常用于识别共表达基因的簇。然而,微阵列数据集常常包含缺失值(MVs),这是使用聚类方法的一个主要缺点。通常缺失值不做处理,或者用零替换,或者通过k近邻(kNN)方法估计。本文的主题是研究微阵列实验中由各种层次聚类算法定义的基因簇的稳定性,这些实验包含或不包含缺失值。
在本研究中,我们表明缺失值对基因簇的稳定性有重要影响。此外,基因错配的程度取决于聚合算法。最合适的聚合方法(例如完全连锁法和沃德法)对缺失值高度敏感,令人惊讶的是,对于非常小比例的缺失值(例如1%)也是如此。在大多数情况下,缺失值必须用期望值替换。用kNN方法替换缺失值明显改善了共表达基因簇的识别。然而,我们观察到kNN方法不太适合基因表达的极端值。
缺失值的存在(即使比例很低)是基因簇不稳定的一个主要因素。此外,影响取决于所使用的层次聚类算法。有些方法应谨慎使用。然而,kNN方法是一种恢复缺失表达基因值的有效方法,错误水平较低。我们的研究强调了对微阵列数据进行统计处理以避免错误解读的必要性。