Tseng George C
Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA.
Bioinformatics. 2007 Sep 1;23(17):2247-55. doi: 10.1093/bioinformatics/btm320. Epub 2007 Jun 27.
Cluster analysis is one of the most important data mining tools for investigating high-throughput biological data. The existence of many scattered objects that should not be clustered has been found to hinder performance of most traditional clustering algorithms in such a high-dimensional complex situation. Very often, additional prior knowledge from databases or previous experiments is also available in the analysis. Excluding scattered objects and incorporating existing prior information are desirable to enhance the clustering performance.
In this article, a class of loss functions is proposed for cluster analysis and applied in high-throughput genomic and proteomic data. Two major extensions from K-means are involved: penalization and weighting. The additive penalty term is used to allow a set of scattered objects without being clustered. Weights are introduced to account for prior information of preferred or prohibited cluster patterns to be identified. Their relationship with the classification likelihood of Gaussian mixture models is explored. Incorporation of good prior information is also shown to improve the global optimization issue in clustering. Applications of the proposed method on simulated data as well as high-throughput data sets from tandem mass spectrometry (MS/MS) and microarray experiments are presented. Our results demonstrate its superior performance over most existing methods and its computational simplicity and extensibility in the application of large complex biological data sets.
http://www.pitt.edu/~ctseng/research/software.html.
Supplementary data are available at Bioinformatics online.
聚类分析是研究高通量生物数据最重要的数据挖掘工具之一。在这种高维复杂情况下,发现许多不应聚类的分散对象的存在会阻碍大多数传统聚类算法的性能。在分析中,通常还可从数据库或先前实验中获得额外的先验知识。排除分散对象并纳入现有先验信息有助于提高聚类性能。
本文提出了一类用于聚类分析的损失函数,并将其应用于高通量基因组和蛋白质组数据。涉及对K均值算法的两个主要扩展:惩罚和加权。加法惩罚项用于允许一组分散对象不被聚类。引入权重以考虑待识别的偏好或禁止聚类模式的先验信息。探讨了它们与高斯混合模型分类似然性的关系。还表明纳入良好的先验信息可改善聚类中的全局优化问题。展示了所提出方法在模拟数据以及串联质谱(MS/MS)和微阵列实验的高通量数据集上的应用。我们的结果证明了其相对于大多数现有方法的优越性能,以及在应用于大型复杂生物数据集时的计算简便性和可扩展性。
http://www.pitt.edu/~ctseng/research/software.html。
补充数据可在《生物信息学》在线获取。