研究非线性降维方法在基因和蛋白质表达研究分类中的有效性。

Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies.

作者信息

Lee George, Rodriguez Carlos, Madabhushi Anant

机构信息

Department of Biomedical Engineering, Rutgers The State University of New Jersey, 599 Taylor Road, Piscatway, NJ 08854, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2008 Jul-Sep;5(3):368-84. doi: 10.1109/TCBB.2008.36.

DOI:10.1109/TCBB.2008.36

PMID:18670041

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2562675/

Abstract

The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the 'curse of dimensionality', occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable.

摘要

近期，用于癌症诊断的高维基因和蛋白质表达谱数据集在采购和可得性方面激增，这就需要开发复杂的机器学习工具来对其进行分析。准确分类这些高维数据集能力的一个主要限制源于“维度诅咒”，这种情况发生在基因或肽的数量显著超过患者样本总数时。以往处理这个问题的尝试大多集中在使用降维（DR）方案，即主成分分析（PCA），来获得高维数据的低维投影。然而，线性PCA和其他依赖欧几里得距离来估计对象相似度的线性DR方法，并未考虑与大多数生物医学数据相关的内在潜在非线性结构。这项工作的动机是确定用于分析高维基因和蛋白质表达研究的合适DR方法。为此，我们通过实证和严格比较三种非线性（等距映射、局部线性嵌入、拉普拉斯特征映射）和三种线性DR方案（PCA、线性判别分析、多维缩放），旨在确定一个降维子空间表示，其中各个对象类别更易于区分。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

研究非线性降维方法在基因和蛋白质表达研究分类中的有效性。

Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

研究非线性降维方法在基因和蛋白质表达研究分类中的有效性。

Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献