Suppr超能文献

最小熵聚类及其在基因表达分析中的应用。

Minimum entropy clustering and applications to gene expression analysis.

作者信息

Li Haifeng, Zhang Keshu, Jiang Tao

机构信息

University of California at Riverside, 92521, USA.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2004:142-51. doi: 10.1109/csb.2004.1332427.

Abstract

Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.

摘要

聚类是分析基因表达数据的一种常用方法。在本文中,我们从信息论的角度提出了一种新的聚类算法。首先,我们提出了最小熵(以后验概率衡量)准则,即给定观测值时聚类的条件熵。法诺不等式表明它可能是一个很好的聚类准则。我们通过用哈弗达 - 查尔瓦特的结构α - 熵代替香农熵来推广该准则。有趣的是,当α = 2时,基于结构α - 熵的最小熵准则等于最近邻方法的概率误差。这是所提出的准则适用于聚类的另一个证据。通过一种用于估计后验概率的非参数方法,然后建立了一种有效的迭代算法来最小化熵。实验结果表明,在调整兰德指数方面,该聚类算法的性能明显优于k均值/中位数、层次聚类、自组织映射和期望最大化算法。特别是,即使在聚类的正确数量未知的情况下,我们的算法也表现得非常好。此外,大多数聚类算法在存在离群值时会产生较差的划分,而我们的方法能够正确揭示数据结构并同时有效地识别离群值。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验