Pamukçu Esra, Bozdogan Hamparsum, Çalık Sinan
Department of Statistics, Faculty of Science, Firat University, 23119 Elazig, Turkey.
Department of Business Analytics and Statistics, The University of Tennessee, Knoxville, TN 37996, USA.
Comput Math Methods Med. 2015;2015:370640. doi: 10.1155/2015/370640. Epub 2015 Feb 19.
Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike's information criterion (AIC), consistent Akaike's information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions.
基因表达数据通常规模庞大、复杂且噪声极高。其维度很高,包含数千个基因(即特征),但观测值(即样本)数量有限。尽管经典主成分分析(PCA)方法被广泛用作降维和监督与非监督分类的首个标准步骤,但在涉及小样本的数据集情况下,它存在若干缺点,因为样本协方差矩阵会退化并变得奇异。在本文中,我们在概率主成分分析(PPCA)的背景下解决这些局限性,通过引入并开发一种使用最大熵协方差矩阵及其混合平滑协方差估计器的全新方法。为了降低数据维度并选择要保留的概率主成分(PPC)数量,我们进一步引入并开发了著名的赤池信息准则(AIC)、一致赤池信息准则(CAIC)以及博兹多根的复杂度信息论度量(ICOMP)准则。分析了六个公开可用的小样本基准数据集,以展示我们使用混合平滑协方差矩阵估计器的方法的实用性、灵活性和通用性,该方法不会退化以执行PPCA来降低维度并对高维癌症组进行监督分类。