Kapp Amy V, Tibshirani Robert
Department of Statistics, Stanford University, Stanford, CA 94305-4065, USA.
Biostatistics. 2007 Jan;8(1):9-31. doi: 10.1093/biostatistics/kxj029. Epub 2006 Apr 12.
In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).
在许多微阵列研究中,人们会在一个独立的数据集中寻找在另一个数据集上定义的聚类。如果在新数据集中找到了该聚类,那么这个聚类就被认为是“可重现的”,并且可能具有生物学意义。将一个新数据归类到先前定义的聚类中,可以看作是预测哪个先前定义的聚类与新数据最相似。如果归类到一个聚类中的新数据在分子或临床方面与该聚类中已有的数据相似,那么这个聚类就是可重现的,相应的预测准确性也很高。在此,我们利用可重现性与预测准确性之间的联系,为在与聚类特征化数据集无关的其他数据集中找到的聚类开发一种验证程序。我们定义了一种称为“组内比例”(IGP)的聚类质量度量,并引入了一种单独验证聚类的通用程序。通过模拟和真实的乳腺癌数据集,将IGP与其他四种常用的聚类质量度量(同质性得分、分离得分、轮廓宽度和加权平均差异对得分)进行比较。此外,还利用模拟和真实的乳腺癌数据集对所有使用IGP但在生成零分布方式上有所不同的四种验证程序版本进行比较。我们发现IGP是预测准确性的最佳度量,并且其中一种验证程序版本比其他三种更具广泛适用性。该算法的一个实现版本包含在一个名为“clusterRepro”的包中,可通过综合R存档网络(http://cran.r-project.org)获取。