Song Joon Jin, Ren Yuan, Yan Fenglan
Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA.
Comput Biol Chem. 2009 Oct;33(5):408-13. doi: 10.1016/j.compbiolchem.2009.07.017. Epub 2009 Aug 18.
High-throughput data have been widely used in biological and medical studies to discover gene and protein functions. Due to the high dimensionality, principal component analysis (PCA) is often involved for data dimension reduction. However, when a few principal components (PCs) are selected for dimension reduction or considered for dimension determination, they are typically ranked by their variances, eigenvalues. However, this approach is not always effective in subsequent multivariate analysis, particularly classification. To maximize information from data with a subset of the components, we apply a different ranking criterion, canonical variate criterion, which considers within- and between-group variance rather than total variance in the classical criterion. Four prevalent classification methods are considered and compared using leave-one-out cross-validation. These methods are illustrated with three real high-throughput data sets, two microarray data sets and a nuclear magnetic resonance spectra data set.
高通量数据已广泛应用于生物学和医学研究中,以发现基因和蛋白质的功能。由于数据的高维度性,主成分分析(PCA)经常被用于数据降维。然而,当选择少数主成分(PC)进行降维或用于维度确定时,它们通常是根据其方差(即特征值)进行排序的。然而,这种方法在随后的多变量分析中,尤其是分类分析中,并不总是有效。为了从数据的一个子集成分中最大化信息,我们应用了一种不同的排序标准,即典型变量标准,该标准考虑组内和组间方差,而不是经典标准中的总方差。使用留一法交叉验证来考虑和比较四种流行的分类方法。通过三个真实的高通量数据集(两个微阵列数据集和一个核磁共振光谱数据集)对这些方法进行了说明。