Kim Seo Young, Lee Jae Won, Bae Jong Sung
Research Institute for Basic Science, Chonnam National University, Gwangju, 500-757, Korea.
BMC Bioinformatics. 2006 Mar 14;7:134. doi: 10.1186/1471-2105-7-134.
Microarray technology has made it possible to simultaneously measure the expression levels of large numbers of genes in a short time. Gene expression data is information rich; however, extensive data mining is required to identify the patterns that characterize the underlying mechanisms of action. Clustering is an important tool for finding groups of genes with similar expression patterns in microarray data analysis. However, hard clustering methods, which assign each gene exactly to one cluster, are poorly suited to the analysis of microarray datasets because in such datasets the clusters of genes frequently overlap.
In this study we applied the fuzzy partitional clustering method known as Fuzzy C-Means (FCM) to overcome the limitations of hard clustering. To identify the effect of data normalization, we used three normalization methods, the two common scale and location transformations and Lowess normalization methods, to normalize three microarray datasets and three simulated datasets. First we determined the optimal parameters for FCM clustering. We found that the optimal fuzzification parameter in the FCM analysis of a microarray dataset depended on the normalization method applied to the dataset during preprocessing. We additionally evaluated the effect of normalization of noisy datasets on the results obtained when hard clustering or FCM clustering was applied to those datasets. The effects of normalization were evaluated using both simulated datasets and microarray datasets. A comparative analysis showed that the clustering results depended on the normalization method used and the noisiness of the data. In particular, the selection of the fuzzification parameter value for the FCM method was sensitive to the normalization method used for datasets with large variations across samples.
Lowess normalization is more robust for clustering of genes from general microarray data than the two common scale and location adjustment methods when samples have varying expression patterns or are noisy. In particular, the FCM method slightly outperformed the hard clustering methods when the expression patterns of genes overlapped and was advantageous in finding co-regulated genes. Thus, the FCM approach offers a convenient method for finding subsets of genes that are strongly associated to a given cluster.
微阵列技术使得在短时间内同时测量大量基因的表达水平成为可能。基因表达数据包含丰富的信息;然而,需要进行广泛的数据挖掘才能识别出表征潜在作用机制的模式。在微阵列数据分析中,聚类是寻找具有相似表达模式的基因群组的重要工具。然而,硬聚类方法将每个基因精确地分配到一个聚类中,不太适合分析微阵列数据集,因为在这类数据集中基因聚类经常重叠。
在本研究中,我们应用了称为模糊C均值(FCM)的模糊划分聚类方法来克服硬聚类的局限性。为了确定数据归一化的效果,我们使用了三种归一化方法,即两种常见的尺度和位置变换方法以及局部加权散点平滑回归(Lowess)归一化方法,对三个微阵列数据集和三个模拟数据集进行归一化处理。首先,我们确定了FCM聚类的最佳参数。我们发现,在对微阵列数据集进行FCM分析时,最佳模糊化参数取决于预处理期间应用于该数据集的归一化方法。我们还评估了对有噪声数据集进行归一化处理对将硬聚类或FCM聚类应用于这些数据集时所获得结果的影响。使用模拟数据集和微阵列数据集评估了归一化的效果。比较分析表明,聚类结果取决于所使用的归一化方法和数据的噪声程度。特别是,对于样本间差异较大的数据集中使用的FCM方法,模糊化参数值的选择对所使用的归一化方法很敏感。
当样本具有不同表达模式或存在噪声时,与两种常见的尺度和位置调整方法相比,Lowess归一化对于从一般微阵列数据中进行基因聚类更为稳健。特别是,当基因表达模式重叠时,FCM方法略优于硬聚类方法,并且在寻找共调控基因方面具有优势。因此,FCM方法为寻找与给定聚类密切相关的基因子集提供了一种便捷方法。