Suppr超能文献

基于软约束亲和传播的聚类:在基因表达数据中的应用

Clustering by soft-constraint affinity propagation: applications to gene-expression data.

作者信息

Leone Michele, Weigt Martin

机构信息

Institute for Scientific Interchange, Viale Settimio Severo 65, Villa Gualino, I-10133 Torino, Italy.

出版信息

Bioinformatics. 2007 Oct 15;23(20):2708-15. doi: 10.1093/bioinformatics/btm414. Epub 2007 Sep 25.

Abstract

MOTIVATION

Similarity-measure-based clustering is a crucial problem appearing throughout scientific data analysis. Recently, a powerful new algorithm called Affinity Propagation (AP) based on message-passing techniques was proposed by Frey and Dueck (2007a). In AP, each cluster is identified by a common exemplar all other data points of the same cluster refer to, and exemplars have to refer to themselves. Albeit its proved power, AP in its present form suffers from a number of drawbacks. The hard constraint of having exactly one exemplar per cluster restricts AP to classes of regularly shaped clusters, and leads to suboptimal performance, e.g. in analyzing gene expression data.

RESULTS

This limitation can be overcome by relaxing the AP hard constraints. A new parameter controls the importance of the constraints compared to the aim of maximizing the overall similarity, and allows to interpolate between the simple case where each data point selects its closest neighbor as an exemplar and the original AP. The resulting soft-constraint affinity propagation (SCAP) becomes more informative, accurate and leads to more stable clustering. Even though a new a priori free parameter is introduced, the overall dependence of the algorithm on external tuning is reduced, as robustness is increased and an optimal strategy for parameter selection emerges more naturally. SCAP is tested on biological benchmark data, including in particular microarray data related to various cancer types. We show that the algorithm efficiently unveils the hierarchical cluster structure present in the data sets. Further on, it allows to extract sparse gene expression signatures for each cluster.

摘要

动机

基于相似度度量的聚类是贯穿科学数据分析的一个关键问题。最近,Frey和Dueck(2007a)提出了一种基于消息传递技术的强大新算法——亲和传播(AP)。在AP中,每个聚类由一个共同的范例来标识,同一聚类的所有其他数据点都指向该范例,且范例必须指向自身。尽管AP已被证明具有强大功能,但其当前形式存在一些缺点。每个聚类恰好有一个范例的硬约束将AP限制于规则形状聚类的类别,并导致性能次优,例如在分析基因表达数据时。

结果

通过放宽AP的硬约束可以克服这一限制。一个新参数控制约束相对于最大化整体相似度目标的重要性,并允许在每个数据点选择其最接近邻居作为范例的简单情况与原始AP之间进行插值。由此产生的软约束亲和传播(SCAP)变得更具信息性、准确性,并能产生更稳定的聚类。尽管引入了一个新的先验自由参数,但由于提高了鲁棒性且更自然地出现了参数选择的最优策略,算法对外部调优的总体依赖性降低了。SCAP在生物基准数据上进行了测试,特别是包括与各种癌症类型相关的微阵列数据。我们表明该算法有效地揭示了数据集中存在的层次聚类结构。此外,它允许为每个聚类提取稀疏的基因表达特征。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验