基于离散动力系统的蝴蝶算法：一种高维数据分类新方法。

Department of Psychiatry, University Health Network, Toronto, Department of Pathology and Molecular Medicine, Queen's University, Kingston, Ontario Cancer Biomarker Network, Toronto and Department of Biomedical and Molecular Sciences, Queen's University, Kingston, Ontario, Canada.

Bioinformatics. 2014 Mar 1;30(5):712-8. doi: 10.1093/bioinformatics/btt602. Epub 2013 Oct 21.

MOTIVATION

We introduce a novel method for visualizing high dimensional data via a discrete dynamical system. This method provides a 2D representation of the relationship between subjects according to a set of variables without geometric projections, transformed axes or principal components. The algorithm exploits a memory-type mechanism inherent in a certain class of discrete dynamical systems collectively referred to as the chaos game that are closely related to iterative function systems. The goal of the algorithm was to create a human readable representation of high dimensional patient data that was capable of detecting unrevealed subclusters of patients from within anticipated classifications. This provides a mechanism to further pursue a more personalized exploration of pathology when used with medical data. For clustering and classification protocols, the dynamical system portion of the algorithm is designed to come after some feature selection filter and before some model evaluation (e.g. clustering accuracy) protocol. In the version given here, a univariate features selection step is performed (in practice more complex feature selection methods are used), a discrete dynamical system is driven by this reduced set of variables (which results in a set of 2D cluster models), these models are evaluated for their accuracy (according to a user-defined binary classification) and finally a visual representation of the top classification models are returned. Thus, in addition to the visualization component, this methodology can be used for both supervised and unsupervised machine learning as the top performing models are returned in the protocol we describe here.

RESULTS

Butterfly, the algorithm we introduce and provide working code for, uses a discrete dynamical system to classify high dimensional data and provide a 2D representation of the relationship between subjects. We report results on three datasets (two in the article; one in the appendix) including a public lung cancer dataset that comes along with the included Butterfly R package. In the included R script, a univariate feature selection method is used for the dimension reduction step, but in the future we wish to use a more powerful multivariate feature reduction method based on neural networks (Kriesel, 2007).

AVAILABILITY AND IMPLEMENTATION

A script written in R (designed to run on R studio) accompanies this article that implements this algorithm and is available at http://butterflygeraci.codeplex.com/. For details on the R package or for help installing the software refer to the accompanying document, Supporting Material and Appendix.

动机

我们介绍了一种通过离散动力系统可视化高维数据的新方法。这种方法提供了一种根据一组变量而不是几何投影、变换轴或主成分来表示主体之间关系的 2D 表示。该算法利用了一类离散动力系统固有的记忆型机制，这些系统统称为混沌游戏，与迭代函数系统密切相关。该算法的目的是创建一种可用于检测高维患者数据中未揭示的亚群的、易于理解的患者数据表示形式，以便在使用医学数据时进一步探索更个性化的病理机制。对于聚类和分类协议，算法的动力系统部分设计在一些特征选择过滤器之后，在一些模型评估（例如聚类准确性）协议之前。在本文中，给出了一个单变量特征选择步骤（实际上使用更复杂的特征选择方法），通过这个简化的变量集来驱动离散动力系统（这导致了一组 2D 聚类模型），根据用户定义的二分类对这些模型进行准确性评估，最后返回顶级分类模型的可视化表示。因此，除了可视化组件外，该方法还可以用于监督和无监督机器学习，因为我们在描述的协议中返回了表现最佳的模型。

结果

蝴蝶算法，我们引入并提供了工作代码，使用离散动力系统对高维数据进行分类，并提供主体之间关系的 2D 表示。我们报告了三个数据集（包括本文中的两个数据集和附录中的一个数据集）的结果，包括一个带有包含的蝴蝶 R 包的公共肺癌数据集。在包含的 R 脚本中，使用了一种单变量特征选择方法进行降维步骤，但在未来，我们希望使用基于神经网络的更强大的多变量特征减少方法（Kriesel，2007）。

可用性和实现

本文附有一个用 R 编写的脚本（旨在在 R 工作室上运行），实现了该算法，可在 http://butterflygeraci.codeplex.com/ 获得。有关 R 包的详细信息或有关安装软件的帮助，请参考随附的文档、支持材料和附录。

相似文献

Exploring high dimensional data with Butterfly: a novel classification algorithm based on discrete dynamical systems.

Bioinformatics. 2014 Mar 1;30(5):712-8. doi: 10.1093/bioinformatics/btt602. Epub 2013 Oct 21.

Feature selection and nearest centroid classification for protein mass spectrometry.

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

2D-EM clustering approach for high-dimensional data through folding feature vectors.

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):547. doi: 10.1186/s12859-017-1970-8.

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data.

BMC Genomics. 2020 Sep 22;21(1):650. doi: 10.1186/s12864-020-07038-3.

A combinational feature selection and ensemble neural network method for classification of gene expression data.

BMC Bioinformatics. 2004 Sep 27;5:136. doi: 10.1186/1471-2105-5-136.

Genetic test bed for feature selection.

Bioinformatics. 2006 Apr 1;22(7):837-42. doi: 10.1093/bioinformatics/btl008. Epub 2006 Jan 20.

caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic data.

BMC Bioinformatics. 2008 Sep 18;9:383. doi: 10.1186/1471-2105-9-383.

penalizedSVM: a R-package for feature selection SVM classification.

Bioinformatics. 2009 Jul 1;25(13):1711-2. doi: 10.1093/bioinformatics/btp286. Epub 2009 Apr 27.

Targeted projection pursuit for visualizing gene expression data classifications.

Bioinformatics. 2006 Nov 1;22(21):2667-73. doi: 10.1093/bioinformatics/btl463. Epub 2006 Sep 5.

引用本文的文献

Dual blockade of IL-10 and PD-1 leads to control of SIV viral rebound following analytical treatment interruption.

Nat Immunol. 2024 Oct;25(10):1900-1912. doi: 10.1038/s41590-024-01952-4. Epub 2024 Sep 12.

Ensemble Merit Merge Feature Selection for Enhanced Multinomial Classification in Alzheimer's Dementia.

Comput Math Methods Med. 2015;2015:676129. doi: 10.1155/2015/676129. Epub 2015 Oct 20.

Distance-based classifiers as potential diagnostic and prediction tools for human diseases.

BMC Genomics. 2014;15 Suppl 12(Suppl 12):S10. doi: 10.1186/1471-2164-15-S12-S10. Epub 2014 Dec 19.

A composite model for subgroup identification and prediction via bicluster analysis.

PLoS One. 2014 Oct 27;9(10):e111318. doi: 10.1371/journal.pone.0111318. eCollection 2014.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Exploring high dimensional data with Butterfly: a novel classification algorithm based on discrete dynamical systems.

Bioinformatics. 2014 Mar 1;30(5):712-8. doi: 10.1093/bioinformatics/btt602. Epub 2013 Oct 21.

Feature selection and nearest centroid classification for protein mass spectrometry.

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

2D-EM clustering approach for high-dimensional data through folding feature vectors.

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):547. doi: 10.1186/s12859-017-1970-8.

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data.

BMC Genomics. 2020 Sep 22;21(1):650. doi: 10.1186/s12864-020-07038-3.

A combinational feature selection and ensemble neural network method for classification of gene expression data.

BMC Bioinformatics. 2004 Sep 27;5:136. doi: 10.1186/1471-2105-5-136.

Genetic test bed for feature selection.

Bioinformatics. 2006 Apr 1;22(7):837-42. doi: 10.1093/bioinformatics/btl008. Epub 2006 Jan 20.

caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic data.

BMC Bioinformatics. 2008 Sep 18;9:383. doi: 10.1186/1471-2105-9-383.

penalizedSVM: a R-package for feature selection SVM classification.

Bioinformatics. 2009 Jul 1;25(13):1711-2. doi: 10.1093/bioinformatics/btp286. Epub 2009 Apr 27.

Targeted projection pursuit for visualizing gene expression data classifications.

Bioinformatics. 2006 Nov 1;22(21):2667-73. doi: 10.1093/bioinformatics/btl463. Epub 2006 Sep 5.

引用本文的文献

Dual blockade of IL-10 and PD-1 leads to control of SIV viral rebound following analytical treatment interruption.

Nat Immunol. 2024 Oct;25(10):1900-1912. doi: 10.1038/s41590-024-01952-4. Epub 2024 Sep 12.

Ensemble Merit Merge Feature Selection for Enhanced Multinomial Classification in Alzheimer's Dementia.

Comput Math Methods Med. 2015;2015:676129. doi: 10.1155/2015/676129. Epub 2015 Oct 20.

Distance-based classifiers as potential diagnostic and prediction tools for human diseases.

BMC Genomics. 2014;15 Suppl 12(Suppl 12):S10. doi: 10.1186/1471-2164-15-S12-S10. Epub 2014 Dec 19.

A composite model for subgroup identification and prediction via bicluster analysis.

PLoS One. 2014 Oct 27;9(10):e111318. doi: 10.1371/journal.pone.0111318. eCollection 2014.

Exploring high dimensional data with Butterfly: a novel classification algorithm based on discrete dynamical systems.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献