基于主成分分析的无监督特征提取应用于创伤后应激障碍介导的心脏病的计算机辅助药物发现。

Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease.

作者信息

Taguchi Y-h, Iwadate Mitsuo, Umeyama Hideaki

机构信息

Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan.

Department of Biological Science, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan.

出版信息

BMC Bioinformatics. 2015 Apr 30;16:139. doi: 10.1186/s12859-015-0574-4.

BACKGROUND

Feature extraction (FE) is difficult, particularly if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Furthermore, multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems.

RESULTS

Two principal component analysis (PCA)-based FE, specifically, variational Bayes PCA (VBPCA) was extended to perform unsupervised FE, and together with conventional PCA (CPCA)-based unsupervised FE, were tested as sample classification independent unsupervised FE methods. VBPCA- and CPCA-based unsupervised FE both performed well when applied to simulated data, and a posttraumatic stress disorder (PTSD)-mediated heart disease data set that had multiple categorical class observations in mRNA/microRNA expression of stressed mouse heart. A critical set of PTSD miRNAs/mRNAs were identified that show aberrant expression between treatment and control samples, and significant, negative correlation with one another. Moreover, greater stability and biological feasibility than conventional supervised FE was also demonstrated. Based on the results obtained, in silico drug discovery was performed as translational validation of the methods.

CONCLUSIONS

Our two proposed unsupervised FE methods (CPCA- and VBPCA-based) worked well on simulated data, and outperformed two conventional supervised FE methods on a real data set. Thus, these two methods have suggested equivalence for FE on categorical multiclass data sets, with potential translational utility for in silico drug discovery.

背景

特征提取（FE）具有挑战性，特别是当特征数量多于样本数量时，因为小样本数量往往会导致有偏差的结果或过拟合。此外，多个样本类别常常使特征提取变得复杂，因为在监督式特征提取中常见的性能评估，通常比二分类问题更难。开发独立于样本分类的无监督方法将解决其中许多问题。

结果

基于主成分分析（PCA）的两种特征提取方法，具体而言，变分贝叶斯主成分分析（VBPCA）被扩展以执行无监督特征提取，并与基于传统主成分分析（CPCA）的无监督特征提取一起，作为独立于样本分类的无监督特征提取方法进行测试。当应用于模拟数据以及一个创伤后应激障碍（PTSD）介导的心脏病数据集时，基于VBPCA和CPCA的无监督特征提取均表现良好，该数据集在应激小鼠心脏的mRNA/微小RNA表达中有多个分类类别观察值。确定了一组关键的PTSD微小RNA/mRNA，它们在治疗样本和对照样本之间显示出异常表达，并且彼此之间存在显著的负相关。此外，还证明了比传统监督式特征提取具有更高的稳定性和生物学可行性。基于所得结果，进行了计算机辅助药物发现，作为该方法的转化验证。