Suppr超能文献

基于随机投影的模糊集成聚类用于DNA微阵列数据分析

Fuzzy ensemble clustering based on random projections for DNA microarray data analysis.

作者信息

Avogadri Roberto, Valentini Giorgio

机构信息

DSI, Dipartimento di Scienze dell' Informazione, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy.

出版信息

Artif Intell Med. 2009 Feb-Mar;45(2-3):173-83. doi: 10.1016/j.artmed.2008.07.014. Epub 2008 Sep 17.

Abstract

OBJECTIVE

Two major problems related the unsupervised analysis of gene expression data are represented by the accuracy and reliability of the discovered clusters, and by the biological fact that the boundaries between classes of patients or classes of functionally related genes are sometimes not clearly defined. The main goal of this work consists in the exploration of new strategies and in the development of new clustering methods to improve the accuracy and robustness of clustering results, taking into account the uncertainty underlying the assignment of examples to clusters in the context of gene expression data analysis.

METHODOLOGY

We propose a fuzzy ensemble clustering approach both to improve the accuracy of clustering results and to take into account the inherent fuzziness of biological and bio-medical gene expression data. We applied random projections that obey the Johnson-Lindenstrauss lemma to obtain several instances of lower dimensional gene expression data from the original high-dimensional ones, approximately preserving the information and the metric structure of the original data. Then we adopt a double fuzzy approach to obtain a consensus ensemble clustering, by first applying a fuzzy k-means algorithm to the different instances of the projected low-dimensional data and then by using a fuzzy t-norm to combine the multiple clusterings. Several variants of the fuzzy ensemble clustering algorithms are proposed, according to different techniques to combine the base clusterings and to obtain the final consensus clustering.

RESULTS AND CONCLUSION

We applied our proposed fuzzy ensemble methods to the gene expression analysis of leukemia, lymphoma, adenocarcinoma and melanoma patients, and we compared the results with other state of the art ensemble methods. Results show that in some cases, taking into account the natural fuzziness of the data, we can improve the discovery of classes of patients defined at bio-molecular level. The reduction of the dimension of the data, achieved through random projections techniques, is well-suited to the characteristics of high-dimensional gene expression data, thus resulting in improved performance with respect to single fuzzy k-means and with respect to ensemble methods based on resampling techniques. Moreover, we show that the analysis of the accuracy and diversity of the base fuzzy clusterings can be useful to explain the advantages and the limitations of the proposed fuzzy ensemble approach.

摘要

目的

基因表达数据的无监督分析存在两个主要问题,一是所发现聚类的准确性和可靠性,二是生物学事实,即患者类别或功能相关基因类别之间的界限有时并不明确。这项工作的主要目标在于探索新策略并开发新的聚类方法,以提高聚类结果的准确性和稳健性,同时考虑到在基因表达数据分析中示例分配到聚类时存在的不确定性。

方法

我们提出一种模糊集成聚类方法,既能提高聚类结果的准确性,又能考虑到生物和生物医学基因表达数据固有的模糊性。我们应用服从约翰逊 - 林登施特劳斯引理的随机投影,从原始高维基因表达数据中获取几个低维数据实例,近似保留原始数据的信息和度量结构。然后我们采用双重模糊方法来获得一个一致的集成聚类,首先对投影后的低维数据的不同实例应用模糊k均值算法,然后使用模糊t范数来组合多个聚类。根据组合基础聚类和获得最终一致聚类的不同技术,提出了模糊集成聚类算法的几种变体。

结果与结论

我们将所提出的模糊集成方法应用于白血病、淋巴瘤、腺癌和黑色素瘤患者的基因表达分析,并将结果与其他现有集成方法进行比较。结果表明,在某些情况下,考虑到数据的自然模糊性,我们可以改进在生物分子水平上定义的患者类别的发现。通过随机投影技术实现的数据降维非常适合高维基因表达数据的特征,因此相对于单一模糊k均值和基于重采样技术的集成方法,性能有所提高。此外,我们表明对基础模糊聚类的准确性和多样性进行分析有助于解释所提出的模糊集成方法的优点和局限性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验