Institute of High Energy Physics, Austrian Academy of Sciences, Vienna, Austria.
BMC Bioinformatics. 2011 Aug 31;12:358. doi: 10.1186/1471-2105-12-358.
Clustering is a widely applicable pattern recognition method for discovering groups of similar observations in data. While there are a large variety of clustering algorithms, very few of these can enforce constraints on the variation of attributes for data points included in a given cluster. In particular, a clustering algorithm that can limit variation within a cluster according to that cluster's position (centroid location) can produce effective and optimal results in many important applications ranging from clustering of silicon pixels or calorimeter cells in high-energy physics to label-free liquid chromatography based mass spectrometry (LC-MS) data analysis in proteomics and metabolomics.
We present MEDEA (M-Estimator with DEterministic Annealing), an M-estimator based, new unsupervised algorithm that is designed to enforce position-specific constraints on variance during the clustering process. The utility of MEDEA is demonstrated by applying it to the problem of "peak matching"--identifying the common LC-MS peaks across multiple samples--in proteomic biomarker discovery. Using real-life datasets, we show that MEDEA not only outperforms current state-of-the-art model-based clustering methods, but also results in an implementation that is significantly more efficient, and hence applicable to much larger LC-MS data sets.
MEDEA is an effective and efficient solution to the problem of peak matching in label-free LC-MS data. The program implementing the MEDEA algorithm, including datasets, clustering results, and supplementary information is available from the author website at http://www.hephy.at/user/fru/medea/.
聚类是一种广泛应用的模式识别方法,用于发现数据中相似观测值的群组。虽然有各种各样的聚类算法,但很少有算法能够对给定聚类中数据点的属性变化施加约束。特别是,能够根据聚类的位置(质心位置)限制聚类内变化的聚类算法,可以在许多重要的应用中产生有效和最佳的结果,这些应用范围从高能物理学中的硅像素或量热计单元聚类到蛋白质组学和代谢组学中的无标记液相色谱-质谱(LC-MS)数据分析。
我们提出了 MEDEA(基于 M-估计的确定性退火),这是一种基于 M-估计的新无监督算法,旨在在聚类过程中对方差施加位置特定的约束。通过将 MEDEA 应用于蛋白质组学生物标志物发现中的“峰匹配”问题(识别多个样本中的共同 LC-MS 峰),证明了 MEDEA 的实用性。使用真实数据集,我们表明 MEDEA 不仅优于当前最先进的基于模型的聚类方法,而且还实现了一种效率更高的方法,因此适用于更大的 LC-MS 数据集。
MEDEA 是无标记 LC-MS 数据中峰匹配问题的有效且高效的解决方案。执行 MEDEA 算法的程序,包括数据集、聚类结果和补充信息,可从作者网站 http://www.hephy.at/user/fru/medea/ 获取。