Department of Biomedical Engineering, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China.
Chin Med J (Engl). 2012 Sep;125(17):3048-52.
Clustering is a useful exploratory technique for interpreting gene expression data to reveal groups of genes sharing common functional attributes. Biologists frequently face the problem of choosing an appropriate algorithm. We aimed to provide a standalone, easily accessible and biologically oriented criterion for expression data clustering evaluation.
An external criterion utilizing annotation based similarities between genes is proposed in this work. Gene ontology information is employed as the annotation source. Comparisons among six widely used clustering algorithms over various types of gene expression data sets were carried out based on the criterion proposed.
The rank of these algorithms given by the criterion coincides with our common knowledge. Single-linkage has significantly poorer performance, even worse than the random algorithm. Ward's method archives the best performance in most cases.
The criterion proposed has a strong ability to distinguish among different clustering algorithms with different distance measurements. It is also demonstrated that analyzing main contributors of the criterion may offer some guidelines in finding local compact clusters. As an addition, we suggest using Ward's algorithm for gene expression data analysis.
聚类是一种有用的探索性技术,用于解释基因表达数据,以揭示具有共同功能属性的基因群。生物学家经常面临选择合适算法的问题。我们旨在为表达数据聚类评估提供一个独立的、易于访问的和具有生物学导向的标准。
本工作提出了一种利用基因间基于注释相似性的外部标准。基因本体信息被用作注释来源。根据所提出的标准,对六种广泛使用的聚类算法在各种类型的基因表达数据集上的性能进行了比较。
该标准给出的这些算法的排名与我们的常识相符。单链接算法的性能明显较差,甚至比随机算法还要差。沃德方法在大多数情况下表现最好。
所提出的标准具有区分不同距离度量的聚类算法的强大能力。此外,我们还表明,分析标准的主要贡献者可能为找到局部紧凑聚类提供一些指导。作为补充,我们建议在基因表达数据分析中使用 Ward 算法。