用于高通量生物数据中具有分散对象和先验信息的聚类的惩罚加权K均值算法

Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data.

作者信息

Tseng George C

机构信息

Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA.

出版信息

Bioinformatics. 2007 Sep 1;23(17):2247-55. doi: 10.1093/bioinformatics/btm320. Epub 2007 Jun 27.

DOI:10.1093/bioinformatics/btm320

PMID:17597097

Abstract

MOTIVATION

Cluster analysis is one of the most important data mining tools for investigating high-throughput biological data. The existence of many scattered objects that should not be clustered has been found to hinder performance of most traditional clustering algorithms in such a high-dimensional complex situation. Very often, additional prior knowledge from databases or previous experiments is also available in the analysis. Excluding scattered objects and incorporating existing prior information are desirable to enhance the clustering performance.

RESULTS

In this article, a class of loss functions is proposed for cluster analysis and applied in high-throughput genomic and proteomic data. Two major extensions from K-means are involved: penalization and weighting. The additive penalty term is used to allow a set of scattered objects without being clustered. Weights are introduced to account for prior information of preferred or prohibited cluster patterns to be identified. Their relationship with the classification likelihood of Gaussian mixture models is explored. Incorporation of good prior information is also shown to improve the global optimization issue in clustering. Applications of the proposed method on simulated data as well as high-throughput data sets from tandem mass spectrometry (MS/MS) and microarray experiments are presented. Our results demonstrate its superior performance over most existing methods and its computational simplicity and extensibility in the application of large complex biological data sets.

AVAILABILITY

http://www.pitt.edu/~ctseng/research/software.html.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

聚类分析是研究高通量生物数据最重要的数据挖掘工具之一。在这种高维复杂情况下，发现许多不应聚类的分散对象的存在会阻碍大多数传统聚类算法的性能。在分析中，通常还可从数据库或先前实验中获得额外的先验知识。排除分散对象并纳入现有先验信息有助于提高聚类性能。

结果

本文提出了一类用于聚类分析的损失函数，并将其应用于高通量基因组和蛋白质组数据。涉及对K均值算法的两个主要扩展：惩罚和加权。加法惩罚项用于允许一组分散对象不被聚类。引入权重以考虑待识别的偏好或禁止聚类模式的先验信息。探讨了它们与高斯混合模型分类似然性的关系。还表明纳入良好的先验信息可改善聚类中的全局优化问题。展示了所提出方法在模拟数据以及串联质谱（MS/MS）和微阵列实验的高通量数据集上的应用。我们的结果证明了其相对于大多数现有方法的优越性能，以及在应用于大型复杂生物数据集时的计算简便性和可扩展性。

可用性

http://www.pitt.edu/~ctseng/research/software.html。

补充信息

补充数据可在《生物信息学》在线获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于高通量生物数据中具有分散对象和先验信息的聚类的惩罚加权K均值算法

Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

SUPPLEMENTARY INFORMATION

动机

结果

可用性

补充信息

相似文献

引用本文的文献

用于高通量生物数据中具有分散对象和先验信息的聚类的惩罚加权K均值算法

Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

SUPPLEMENTARY INFORMATION

动机

结果

可用性

补充信息

相似文献

引用本文的文献