Taguchi Y-h, Iwadate Mitsuo, Umeyama Hideaki
Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan.
Department of Biological Science, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan.
BMC Bioinformatics. 2015 Apr 30;16:139. doi: 10.1186/s12859-015-0574-4.
Feature extraction (FE) is difficult, particularly if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Furthermore, multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems.
Two principal component analysis (PCA)-based FE, specifically, variational Bayes PCA (VBPCA) was extended to perform unsupervised FE, and together with conventional PCA (CPCA)-based unsupervised FE, were tested as sample classification independent unsupervised FE methods. VBPCA- and CPCA-based unsupervised FE both performed well when applied to simulated data, and a posttraumatic stress disorder (PTSD)-mediated heart disease data set that had multiple categorical class observations in mRNA/microRNA expression of stressed mouse heart. A critical set of PTSD miRNAs/mRNAs were identified that show aberrant expression between treatment and control samples, and significant, negative correlation with one another. Moreover, greater stability and biological feasibility than conventional supervised FE was also demonstrated. Based on the results obtained, in silico drug discovery was performed as translational validation of the methods.
Our two proposed unsupervised FE methods (CPCA- and VBPCA-based) worked well on simulated data, and outperformed two conventional supervised FE methods on a real data set. Thus, these two methods have suggested equivalence for FE on categorical multiclass data sets, with potential translational utility for in silico drug discovery.
特征提取(FE)具有挑战性,特别是当特征数量多于样本数量时,因为小样本数量往往会导致有偏差的结果或过拟合。此外,多个样本类别常常使特征提取变得复杂,因为在监督式特征提取中常见的性能评估,通常比二分类问题更难。开发独立于样本分类的无监督方法将解决其中许多问题。
基于主成分分析(PCA)的两种特征提取方法,具体而言,变分贝叶斯主成分分析(VBPCA)被扩展以执行无监督特征提取,并与基于传统主成分分析(CPCA)的无监督特征提取一起,作为独立于样本分类的无监督特征提取方法进行测试。当应用于模拟数据以及一个创伤后应激障碍(PTSD)介导的心脏病数据集时,基于VBPCA和CPCA的无监督特征提取均表现良好,该数据集在应激小鼠心脏的mRNA/微小RNA表达中有多个分类类别观察值。确定了一组关键的PTSD微小RNA/mRNA,它们在治疗样本和对照样本之间显示出异常表达,并且彼此之间存在显著的负相关。此外,还证明了比传统监督式特征提取具有更高的稳定性和生物学可行性。基于所得结果,进行了计算机辅助药物发现,作为该方法的转化验证。
我们提出的两种无监督特征提取方法(基于CPCA和VBPCA)在模拟数据上表现良好,并且在真实数据集上优于两种传统监督式特征提取方法。因此,这两种方法在分类多类数据集的特征提取方面表现相当,具有计算机辅助药物发现的潜在转化应用价值。