Department of Computer Science, The University of Texas at San Antonio, One UTSA Circle, San Antonio, 78249, TX, USA.
Department of Computer & Electrical Engineering/Computer Science, California State University, Bakersfield, 9001 Stockdale Highway, Bakersfield, 93311, CA, USA.
BMC Bioinformatics. 2020 Sep 30;21(Suppl 14):359. doi: 10.1186/s12859-020-03692-2.
The abundance of molecular profiling of breast cancer tissues entailed active research on molecular marker-based early diagnosis of metastasis. Recently there is a surging interest in combining gene expression with gene networks such as protein-protein interaction (PPI) network, gene co-expression (CE) network and pathway information to identify robust and accurate biomarkers for metastasis prediction, reflecting the common belief that cancer is a systems biology disease. However, controversy exists in the literature regarding whether network markers are indeed better features than genes alone for predicting as well as understanding metastasis. We believe much of the existing results may have been biased by the overly complicated prediction algorithms, unfair evaluation, and lack of rigorous statistics. In this study, we propose a simple approach to use network edges as features, based on two types of networks respectively, and compared their prediction power using three classification algorithms and rigorous statistical procedure on one of the largest datasets available. To detect biomarkers that are significant for the prediction and to compare the robustness of different feature types, we propose an unbiased and novel procedure to measure feature importance that eliminates the potential bias from factors such as different sample size, number of features, as well as class distribution.
Experimental results reveal that edge-based feature types consistently outperformed gene-based feature type in random forest and logistic regression models under all performance evaluation metrics, while the prediction accuracy of edge-based support vector machine (SVM) model was poorer, due to the larger number of edge features compared to gene features and the lack of feature selection in SVM model. Experimental results also show that edge features are much more robust than gene features and the top biomarkers from edge feature types are statistically more significantly enriched in the biological processes that are well known to be related to breast cancer metastasis.
Overall, this study validates the utility of edge features as biomarkers but also highlights the importance of carefully designed experimental procedures in order to achieve statistically reliable comparison results.
乳腺癌组织的分子谱分析丰富,促使人们积极研究基于分子标志物的转移早期诊断。最近,人们对将基因表达与蛋白质-蛋白质相互作用(PPI)网络、基因共表达(CE)网络和途径信息等基因网络相结合,以识别用于转移预测的稳健且准确的生物标志物产生了浓厚兴趣,这反映了癌症是一种系统生物学疾病的普遍共识。然而,关于网络标志物是否确实比基因更能预测和理解转移,文献中存在争议。我们认为,现有的许多结果可能受到过度复杂的预测算法、不公平的评估以及缺乏严格的统计数据的影响。在这项研究中,我们提出了一种简单的方法,基于两种网络类型分别使用网络边缘作为特征,并在可用的最大数据集之一上使用三种分类算法和严格的统计程序来比较它们的预测能力。为了检测对预测有重要意义的生物标志物,并比较不同特征类型的稳健性,我们提出了一种无偏且新颖的特征重要性度量方法,该方法消除了样本量、特征数量以及类别分布等因素的潜在偏差。
实验结果表明,在所有性能评估指标下,基于边缘的特征类型在随机森林和逻辑回归模型中始终优于基于基因的特征类型,而基于边缘的支持向量机(SVM)模型的预测精度较差,这是由于与基因特征相比,边缘特征的数量较多,并且 SVM 模型中没有特征选择。实验结果还表明,边缘特征比基因特征更稳健,并且边缘特征类型中的顶级生物标志物在统计学上在已知与乳腺癌转移相关的生物学过程中更为显著富集。
总的来说,这项研究验证了边缘特征作为生物标志物的实用性,但也强调了精心设计实验程序的重要性,以获得具有统计学可靠性的比较结果。