Liu Jun, Mohammed Jaaved, Carter James, Ranka Sanjay, Kahveci Tamer, Baudis Michael
Computer and Information Science and Engineering, University of Florida Gainesville, FL 32611, USA.
Bioinformatics. 2006 Aug 15;22(16):1971-8. doi: 10.1093/bioinformatics/btl185. Epub 2006 May 16.
We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples. The goal is to develop a systematic way of placing patients with similar CGH imbalance profiles into the same cluster. Our expectation is that patients with the same cancer types will generally belong to the same cluster as their underlying CGH profiles will be similar.
We focus on distance-based clustering strategies. We do this in two steps. (1) Distances of all pairs of CGH samples are computed. (2) CGH samples are clustered based on this distance. We develop three pairwise distance/similarity measures, namely raw, cosine and sim. Raw measure disregards correlation between contiguous genomic intervals. It compares the aberrations in each genomic interval separately. The remaining measures assume that consecutive genomic intervals may be correlated. Cosine maps pairs of CGH samples into vectors in a high-dimensional space and measures the angle between them. Sim measures the number of independent common aberrations. We test our distance/similarity measures on three well known clustering algorithms, bottom-up, top-down and k-means with and without centroid shrinking. Our results show that sim consistently performs better than the remaining measures. This indicates that the correlation of neighboring genomic intervals should be considered in the structural analysis of CGH datasets. The combination of sim with top-down clustering emerged as the best approach.
All software developed in this article and all the datasets are available from the authors upon request.
我们考虑对一组比较基因组杂交(CGH)数据样本进行聚类的问题。目标是开发一种系统的方法,将具有相似CGH失衡谱的患者归入同一聚类。我们期望患有相同癌症类型的患者通常会属于同一聚类,因为他们潜在的CGH谱会相似。
我们专注于基于距离的聚类策略。我们分两步进行。(1)计算所有CGH样本对之间的距离。(2)基于此距离对CGH样本进行聚类。我们开发了三种成对距离/相似性度量,即原始度量、余弦度量和sim度量。原始度量忽略相邻基因组区间之间的相关性。它分别比较每个基因组区间内的畸变。其余度量假设连续的基因组区间可能相关。余弦度量将CGH样本对映射到高维空间中的向量,并测量它们之间的夹角。Sim度量独立共同畸变的数量。我们在三种著名的聚类算法上测试我们的距离/相似性度量,即自底向上、自顶向下以及带和不带质心收缩的k均值算法。我们的结果表明,sim度量始终比其余度量表现更好。这表明在CGH数据集的结构分析中应考虑相邻基因组区间的相关性。sim度量与自顶向下聚类的组合成为最佳方法。
本文开发的所有软件和所有数据集可根据作者要求提供。