Smolkin Mark, Ghosh Debashis
Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.
BMC Bioinformatics. 2003 Sep 6;4:36. doi: 10.1186/1471-2105-4-36.
A potential benefit of profiling of tissue samples using microarrays is the generation of molecular fingerprints that will define subtypes of disease. Hierarchical clustering has been the primary analytical tool used to define disease subtypes from microarray experiments in cancer settings. Assessing cluster reliability poses a major complication in analyzing output from clustering procedures. While most work has focused on estimating the number of clusters in a dataset, the question of stability of individual-level clusters has not been addressed.
We address this problem by developing cluster stability scores using subsampling techniques. These scores exploit the redundancy in biologically discriminatory information on the chip. Our approach is generic and can be used with any clustering method. We propose procedures for calculating cluster stability scores for situations involving both known and unknown numbers of clusters. We also develop cluster-size adjusted stability scores. The method is illustrated by application to data three cancer studies; one involving childhood cancers, the second involving B-cell lymphoma, and the final is from a malignant melanoma study.
Code implementing the proposed analytic method can be obtained at the second author's website.
使用微阵列对组织样本进行分析的一个潜在好处是生成能够定义疾病亚型的分子指纹。层次聚类一直是在癌症背景下从微阵列实验中定义疾病亚型的主要分析工具。评估聚类的可靠性是分析聚类程序输出时的一个主要难题。虽然大多数工作都集中在估计数据集中的聚类数量上,但个体水平聚类的稳定性问题尚未得到解决。
我们通过使用子采样技术开发聚类稳定性分数来解决这个问题。这些分数利用了芯片上生物学鉴别信息中的冗余。我们的方法是通用的,可用于任何聚类方法。我们提出了在聚类数量已知和未知的情况下计算聚类稳定性分数的程序。我们还开发了聚类大小调整后的稳定性分数。通过将该方法应用于三项癌症研究的数据进行了说明;一项涉及儿童癌症,第二项涉及B细胞淋巴瘤,最后一项来自恶性黑色素瘤研究。
实现所提出分析方法的代码可在第二作者的网站上获取。