University of São Paulo, São Carlos.
Federal University of Pernambuco, Recife and Aachen University Medical School, RWTH Aachen.
IEEE/ACM Trans Comput Biol Bioinform. 2013 Jul-Aug;10(4):845-57. doi: 10.1109/TCBB.2013.9.
Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
聚类分析通常是揭示基因表达微阵列数据信息的第一步。除了选择聚类算法外,选择适当的相似度度量(相似性或距离)对于获得令人满意的聚类结果非常重要。然而,到目前为止,还没有关于如何为微阵列数据聚类选择相似度度量的综合指南。皮尔逊是最常用的相似度度量,而其他度量的特性仍未被探索。在本文中,我们通过评估 52 个来自时间序列和癌症实验的数据集中的 16 种相似度度量的性能,研究了微阵列数据聚类中相似度度量的选择。我们的结果支持这样一种观点,即在基因表达文献中很少使用的度量标准可以提供比常用的度量标准(如皮尔逊、斯皮尔曼和欧几里得距离)更好的结果。由于不同的度量标准在时间序列和癌症数据评估中表现突出,因此应根据具体情况选择它们。为了评估时间序列数据的度量标准,我们预处理并编译了微阵列文献中的 17 个数据集,以及一种名为内在生物学分离能力(IBSA)的新方法,作为基准。两者都可以在未来的研究中用于评估新的基因时间序列数据度量标准的有效性。