Suppr超能文献

基于离散动力系统的蝴蝶算法:一种高维数据分类新方法。

Exploring high dimensional data with Butterfly: a novel classification algorithm based on discrete dynamical systems.

机构信息

Department of Psychiatry, University Health Network, Toronto, Department of Pathology and Molecular Medicine, Queen's University, Kingston, Ontario Cancer Biomarker Network, Toronto and Department of Biomedical and Molecular Sciences, Queen's University, Kingston, Ontario, Canada.

出版信息

Bioinformatics. 2014 Mar 1;30(5):712-8. doi: 10.1093/bioinformatics/btt602. Epub 2013 Oct 21.

Abstract

MOTIVATION

We introduce a novel method for visualizing high dimensional data via a discrete dynamical system. This method provides a 2D representation of the relationship between subjects according to a set of variables without geometric projections, transformed axes or principal components. The algorithm exploits a memory-type mechanism inherent in a certain class of discrete dynamical systems collectively referred to as the chaos game that are closely related to iterative function systems. The goal of the algorithm was to create a human readable representation of high dimensional patient data that was capable of detecting unrevealed subclusters of patients from within anticipated classifications. This provides a mechanism to further pursue a more personalized exploration of pathology when used with medical data. For clustering and classification protocols, the dynamical system portion of the algorithm is designed to come after some feature selection filter and before some model evaluation (e.g. clustering accuracy) protocol. In the version given here, a univariate features selection step is performed (in practice more complex feature selection methods are used), a discrete dynamical system is driven by this reduced set of variables (which results in a set of 2D cluster models), these models are evaluated for their accuracy (according to a user-defined binary classification) and finally a visual representation of the top classification models are returned. Thus, in addition to the visualization component, this methodology can be used for both supervised and unsupervised machine learning as the top performing models are returned in the protocol we describe here.

RESULTS

Butterfly, the algorithm we introduce and provide working code for, uses a discrete dynamical system to classify high dimensional data and provide a 2D representation of the relationship between subjects. We report results on three datasets (two in the article; one in the appendix) including a public lung cancer dataset that comes along with the included Butterfly R package. In the included R script, a univariate feature selection method is used for the dimension reduction step, but in the future we wish to use a more powerful multivariate feature reduction method based on neural networks (Kriesel, 2007).

AVAILABILITY AND IMPLEMENTATION

A script written in R (designed to run on R studio) accompanies this article that implements this algorithm and is available at http://butterflygeraci.codeplex.com/. For details on the R package or for help installing the software refer to the accompanying document, Supporting Material and Appendix.

摘要

动机

我们介绍了一种通过离散动力系统可视化高维数据的新方法。这种方法提供了一种根据一组变量而不是几何投影、变换轴或主成分来表示主体之间关系的 2D 表示。该算法利用了一类离散动力系统固有的记忆型机制,这些系统统称为混沌游戏,与迭代函数系统密切相关。该算法的目的是创建一种可用于检测高维患者数据中未揭示的亚群的、易于理解的患者数据表示形式,以便在使用医学数据时进一步探索更个性化的病理机制。对于聚类和分类协议,算法的动力系统部分设计在一些特征选择过滤器之后,在一些模型评估(例如聚类准确性)协议之前。在本文中,给出了一个单变量特征选择步骤(实际上使用更复杂的特征选择方法),通过这个简化的变量集来驱动离散动力系统(这导致了一组 2D 聚类模型),根据用户定义的二分类对这些模型进行准确性评估,最后返回顶级分类模型的可视化表示。因此,除了可视化组件外,该方法还可以用于监督和无监督机器学习,因为我们在描述的协议中返回了表现最佳的模型。

结果

蝴蝶算法,我们引入并提供了工作代码,使用离散动力系统对高维数据进行分类,并提供主体之间关系的 2D 表示。我们报告了三个数据集(包括本文中的两个数据集和附录中的一个数据集)的结果,包括一个带有包含的蝴蝶 R 包的公共肺癌数据集。在包含的 R 脚本中,使用了一种单变量特征选择方法进行降维步骤,但在未来,我们希望使用基于神经网络的更强大的多变量特征减少方法(Kriesel,2007)。

可用性和实现

本文附有一个用 R 编写的脚本(旨在在 R 工作室上运行),实现了该算法,可在 http://butterflygeraci.codeplex.com/ 获得。有关 R 包的详细信息或有关安装软件的帮助,请参考随附的文档、支持材料和附录。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验