Wong Dorothy S V, Wong Frederick K, Wood Graham R
Department of Statistics, Macquarie University, NSW 2109, Australia.
Bioinformatics. 2007 Apr 15;23(8):998-1005. doi: 10.1093/bioinformatics/btm053. Epub 2007 Feb 18.
Microarray experiments have revolutionized the study of gene expression with their ability to generate large amounts of data. This article describes an alternative to existing approaches to clustering of gene expression profiles; the key idea is to cluster in stages using a hierarchy of distance measures. This method is motivated by the way in which the human mind sorts and so groups many items. The distance measures arise from the orthogonal breakup of Euclidean distance, giving us a set of independent measures of different attributes of the gene expression profile. Interpretation of these distances is closely related to the statistical design of the microarray experiment. This clustering method not only accommodates missing data but also leads to an associated imputation method.
The performance of the clustering and imputation methods was tested on a simulated dataset, a yeast cell cycle dataset and a central nervous system development dataset. Based on the Rand and adjusted Rand indices, the clustering method is more consistent with the biological classification of the data than commonly used clustering methods. The imputation method, at varying levels of missingness, outperforms most imputation methods, based on root mean squared error (RMSE).
Code in R is available on request from the authors.
微阵列实验凭借其生成大量数据的能力,彻底改变了基因表达的研究方式。本文介绍了一种不同于现有基因表达谱聚类方法的替代方法;其关键思想是使用距离度量层次进行分阶段聚类。该方法的灵感来源于人类思维对众多项目进行分类和分组的方式。距离度量源自欧几里得距离的正交分解,为我们提供了一组关于基因表达谱不同属性的独立度量。这些距离的解释与微阵列实验的统计设计密切相关。这种聚类方法不仅能够处理缺失数据,还引出了一种相关的插补方法。
在一个模拟数据集、一个酵母细胞周期数据集和一个中枢神经系统发育数据集上对聚类和插补方法的性能进行了测试。基于兰德指数和调整后的兰德指数,该聚类方法比常用聚类方法在数据的生物学分类上更为一致。基于均方根误差(RMSE),在不同缺失程度下,插补方法优于大多数插补方法。
可向作者索取R语言代码。