College of Liberal Arts, Sangmyung University, 31 Sangmyungdae-gil, Cheonan, Chungnam 330-729, Republic of Korea; Department of Statistics, Seoul National University, 1 Gwankak-ro, Seoul 151-747, Republic of Korea; Department of Mathematics and Statistics, Boston University, 111 Cummington Mall, Boston, MA 02215, USA.
Department of Statistics, Seoul National University, 1 Gwankak-ro, Seoul 151-747, Republic of Korea.
Artif Intell Med. 2014 Sep;62(1):23-31. doi: 10.1016/j.artmed.2014.06.003. Epub 2014 Jun 21.
Although numerous studies related to cancer survival have been published, increasing the prediction accuracy of survival classes still remains a challenge. Integration of different data sets, such as microRNA (miRNA) and mRNA, might increase the accuracy of survival class prediction. Therefore, we suggested a machine learning (ML) approach to integrate different data sets, and developed a novel method based on feature selection with Cox proportional hazard regression model (FSCOX) to improve the prediction of cancer survival time.
FSCOX provides us with intermediate survival information, which is usually discarded when separating survival into 2 groups (short- and long-term), and allows us to perform survival analysis. We used an ML-based protocol for feature selection, integrating information from miRNA and mRNA expression profiles at the feature level. To predict survival phenotypes, we used the following classifiers, first, existing ML methods, support vector machine (SVM) and random forest (RF), second, a new median-based classifier using FSCOX (FSCOX_median), and third, an SVM classifier using FSCOX (FSCOX_SVM). We compared these methods using 3 types of cancer tissue data sets: (i) miRNA expression, (ii) mRNA expression, and (iii) combined miRNA and mRNA expression. The latter data set included features selected either from the combined miRNA/mRNA profile or independently from miRNAs and mRNAs profiles (IFS).
In the ovarian data set, the accuracy of survival classification using the combined miRNA/mRNA profiles with IFS was 75% using RF, 86.36% using SVM, 84.09% using FSCOX_median, and 88.64% using FSCOX_SVM with a balanced 22 short-term and 22 long-term survivor data set. These accuracies are higher than those using miRNA alone (70.45%, RF; 75%, SVM; 75%, FSCOX_median; and 75%, FSCOX_SVM) or mRNA alone (65.91%, RF; 63.64%, SVM; 72.73%, FSCOX_median; and 70.45%, FSCOX_SVM). Similarly in the glioblastoma multiforme data, the accuracy of miRNA/mRNA using IFS was 75.51% (RF), 87.76% (SVM) 85.71% (FSCOX_median), 85.71% (FSCOX_SVM). These results are higher than the results of using miRNA expression and mRNA expression alone. In addition we predict 16 hsa-miR-23b and hsa-miR-27b target genes in ovarian cancer data sets, obtained by SVM-based feature selection through integration of sequence information and gene expression profiles.
Among the approaches used, the integrated miRNA and mRNA data set yielded better results than the individual data sets. The best performance was achieved using the FSCOX_SVM method with independent feature selection, which uses intermediate survival information between short-term and long-term survival time and the combination of the 2 different data sets. The results obtained using the combined data set suggest that there are some strong interactions between miRNA and mRNA features that are not detectable in the individual analyses.
尽管已经发表了许多与癌症生存相关的研究,但提高生存类别的预测准确性仍然是一个挑战。整合不同的数据组,如 microRNA(miRNA)和 mRNA,可能会提高生存类预测的准确性。因此,我们提出了一种机器学习(ML)方法来整合不同的数据组,并开发了一种基于特征选择与 Cox 比例风险回归模型(FSCOX)的新方法,以提高癌症生存时间的预测。
FSCOX 为我们提供了中间生存信息,当将生存分为 2 组(短期和长期)时,通常会丢弃这些信息,并且允许我们进行生存分析。我们使用基于 ML 的协议进行特征选择,在特征级别整合 miRNA 和 mRNA 表达谱的信息。为了预测生存表型,我们使用了以下分类器:首先,现有的 ML 方法,支持向量机(SVM)和随机森林(RF);其次,使用 FSCOX 的新中位数分类器(FSCOX_median);第三,使用 FSCOX 的 SVM 分类器(FSCOX_SVM)。我们使用三种类型的癌症组织数据进行了比较:(i)miRNA 表达;(ii)mRNA 表达;(iii)miRNA 和 mRNA 联合表达。后一组数据是从 miRNA/mRNA 联合谱或独立于 miRNA 和 mRNAs 谱(IFS)中选择的特征。
在卵巢数据集中,使用 IFS 联合 miRNA/mRNA 谱的 RF 的生存分类准确性为 75%,SVM 为 86.36%,FSCOX_median 为 84.09%,FSCOX_SVM 为 88.64%,平衡了 22 个短期和 22 个长期幸存者数据。这些准确性高于单独使用 miRNA(70.45%,RF;75%,SVM;75%,FSCOX_median;75%,FSCOX_SVM)或 mRNA(65.91%,RF;63.64%,SVM;72.73%,FSCOX_median;70.45%,FSCOX_SVM)的准确性。同样在胶质母细胞瘤多形性数据中,miRNA/mRNA 使用 IFS 的准确性为 75.51%(RF)、87.76%(SVM)、85.71%(FSCOX_median)、85.71%(FSCOX_SVM)。这些结果高于单独使用 miRNA 表达和 mRNA 表达的结果。此外,我们通过整合序列信息和基因表达谱,使用基于 SVM 的特征选择在卵巢癌数据集中预测了 16 个 hsa-miR-23b 和 hsa-miR-27b 靶基因。
在所使用的方法中,整合的 miRNA 和 mRNA 数据集比单个数据集产生了更好的结果。使用独立特征选择的 FSCOX_SVM 方法的性能最佳,该方法使用了短期和长期生存时间之间的中间生存信息以及两个不同数据集的组合。使用联合数据集获得的结果表明,miRNA 和 mRNA 特征之间存在一些无法在单个分析中检测到的强相互作用。