Zuse Institute Berlin, Berlin, Germany.
Dept of mathematics and computer science, Freie Universität Berlin, Berlin, Germany.
PLoS One. 2019 Jan 31;14(1):e0204186. doi: 10.1371/journal.pone.0204186. eCollection 2019.
Various feature selection algorithms have been proposed to identify cancer prognostic biomarkers. In recent years, however, their reproducibility is criticized. The performance of feature selection algorithms is shown to be affected by the datasets, underlying networks and evaluation metrics. One of the causes is the curse of dimensionality, which makes it hard to select the features that generalize well on independent data. Even the integration of biological networks does not mitigate this issue because the networks are large and many of their components are not relevant for the phenotype of interest. With the availability of multi-omics data, integrative approaches are being developed to build more robust predictive models. In this scenario, the higher data dimensions create greater challenges. We proposed a phenotype relevant network-based feature selection (PRNFS) framework and demonstrated its advantages in lung cancer prognosis prediction. We constructed cancer prognosis relevant networks based on epithelial mesenchymal transition (EMT) and integrated them with different types of omics data for feature selection. With less than 2.5% of the total dimensionality, we obtained EMT prognostic signatures that achieved remarkable prediction performance (average AUC values >0.8), very significant sample stratifications, and meaningful biological interpretations. In addition to finding EMT signatures from different omics data levels, we combined these single-omics signatures into multi-omics signatures, which improved sample stratifications significantly. Both single- and multi-omics EMT signatures were tested on independent multi-omics lung cancer datasets and significant sample stratifications were obtained.
已经提出了各种特征选择算法来识别癌症预后生物标志物。然而,近年来,它们的可重复性受到了批评。特征选择算法的性能受到数据集、基础网络和评估指标的影响。原因之一是维度的诅咒,这使得很难选择在独立数据上表现良好的特征。即使整合生物网络也不能缓解这个问题,因为网络很大,其中许多组件与感兴趣的表型无关。随着多组学数据的可用性,正在开发综合方法来构建更稳健的预测模型。在这种情况下,更高的数据维度带来了更大的挑战。我们提出了一种基于表型相关网络的特征选择(PRNFS)框架,并在肺癌预后预测中证明了其优势。我们基于上皮间质转化(EMT)构建了癌症预后相关网络,并将其与不同类型的组学数据集成进行特征选择。在总维度的不到 2.5%的情况下,我们获得了 EMT 预后特征,实现了出色的预测性能(平均 AUC 值>0.8)、非常显著的样本分层和有意义的生物学解释。除了从不同的组学数据水平发现 EMT 特征外,我们还将这些单组学特征组合成多组学特征,这显著提高了样本分层。单组学和多组学 EMT 特征都在独立的多组学肺癌数据集上进行了测试,并获得了显著的样本分层。