Lee George, Rodriguez Carlos, Madabhushi Anant
Department of Biomedical Engineering, Rutgers The State University of New Jersey, 599 Taylor Road, Piscatway, NJ 08854, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2008 Jul-Sep;5(3):368-84. doi: 10.1109/TCBB.2008.36.
The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the 'curse of dimensionality', occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable.
近期,用于癌症诊断的高维基因和蛋白质表达谱数据集在采购和可得性方面激增,这就需要开发复杂的机器学习工具来对其进行分析。准确分类这些高维数据集能力的一个主要限制源于“维度诅咒”,这种情况发生在基因或肽的数量显著超过患者样本总数时。以往处理这个问题的尝试大多集中在使用降维(DR)方案,即主成分分析(PCA),来获得高维数据的低维投影。然而,线性PCA和其他依赖欧几里得距离来估计对象相似度的线性DR方法,并未考虑与大多数生物医学数据相关的内在潜在非线性结构。这项工作的动机是确定用于分析高维基因和蛋白质表达研究的合适DR方法。为此,我们通过实证和严格比较三种非线性(等距映射、局部线性嵌入、拉普拉斯特征映射)和三种线性DR方案(PCA、线性判别分析、多维缩放),旨在确定一个降维子空间表示,其中各个对象类别更易于区分。