INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, 1000-029, Lisbon, Portugal.
NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), NOVA School of Science and Technology, 2829-516, Caparica, Portugal.
BMC Bioinformatics. 2023 Jan 16;24(1):17. doi: 10.1186/s12859-022-05104-z.
Colorectal cancer (CRC) is the third most common cancer and the second most deathly worldwide. It is a very heterogeneous disease that can develop via distinct pathways where metastasis is the primary cause of death. Therefore, it is crucial to understand the molecular mechanisms underlying metastasis. RNA-sequencing is an essential tool used for studying the transcriptional landscape. However, the high-dimensionality of gene expression data makes selecting novel metastatic biomarkers problematic. To distinguish early-stage CRC patients at risk of developing metastasis from those that are not, three types of binary classification approaches were used: (1) classification methods (decision trees, linear and radial kernel support vector machines, logistic regression, and random forest) using differentially expressed genes (DEGs) as input features; (2) regularized logistic regression based on the Elastic Net penalty and the proposed iTwiner-a network-based regularizer accounting for gene correlation information; and (3) classification methods based on the genes pre-selected using regularized logistic regression. Classifiers using the DEGs as features showed similar results, with random forest showing the highest accuracy. Using regularized logistic regression on the full dataset yielded no improvement in the methods' accuracy. Further classification using the pre-selected genes found by different penalty factors, instead of the DEGs, significantly improved the accuracy of the binary classifiers. Moreover, the use of network-based correlation information (iTwiner) for gene selection produced the best classification results and the identification of more stable and robust gene sets. Some are known to be tumor suppressor genes (OPCML-IT2), to be related to resistance to cancer therapies (RAC1P3), or to be involved in several cancer processes such as genome stability (XRCC6P2), tumor growth and metastasis (MIR602) and regulation of gene transcription (NME2P2). We show that the classification of CRC patients based on pre-selected features by regularized logistic regression is a valuable alternative to using DEGs, significantly increasing the models' predictive performance. Moreover, the use of correlation-based penalization for biomarker selection stands as a promising strategy for predicting patients' groups based on RNA-seq data.
结直肠癌(CRC)是全球第三大常见癌症和第二大致死癌症。它是一种非常异质性的疾病,可以通过不同的途径发展,其中转移是主要的死亡原因。因此,了解转移的分子机制至关重要。RNA 测序是研究转录景观的重要工具。然而,基因表达数据的高维性使得选择新的转移性生物标志物成为一个问题。为了区分有转移风险的早期 CRC 患者和没有转移风险的患者,使用了三种二分类方法:(1)使用差异表达基因(DEGs)作为输入特征的分类方法(决策树、线性和径向核支持向量机、逻辑回归和随机森林);(2)基于弹性网络惩罚的正则化逻辑回归和提出的 iTwiner-a 网络正则化器,该正则化器考虑了基因相关性信息;(3)基于正则化逻辑回归预选择基因的分类方法。使用 DEGs 作为特征的分类器表现出相似的结果,其中随机森林的准确率最高。在全数据集上使用正则化逻辑回归并没有提高方法的准确性。进一步使用不同惩罚因子预选择基因进行分类,而不是使用 DEGs,可以显著提高二分类器的准确性。此外,使用网络相关性信息(iTwiner)进行基因选择产生了最佳的分类结果,并确定了更稳定和更稳健的基因集。其中一些基因已知是肿瘤抑制基因(OPCML-IT2),与癌症治疗的耐药性相关(RAC1P3),或者参与了多个癌症过程,如基因组稳定性(XRCC6P2)、肿瘤生长和转移(MIR602)以及基因转录调控(NME2P2)。我们表明,基于正则化逻辑回归预选择特征对 CRC 患者进行分类是使用 DEGs 的一种有价值的替代方法,可以显著提高模型的预测性能。此外,使用基于相关性的惩罚进行生物标志物选择是一种很有前途的策略,可以根据 RNA-seq 数据预测患者群体。