Yu Xin, Yang Qian, Wang Dong, Li Zhaoyang, Chen Nianhang, Kong De-Xin
State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, Hubei, China.
Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China.
PeerJ. 2021 Feb 16;9:e10884. doi: 10.7717/peerj.10884. eCollection 2021.
Applying the knowledge that methyltransferases and demethylases can modify adjacent cytosine-phosphorothioate-guanine (CpG) sites in the same DNA strand, we found that combining multiple CpGs into a single block may improve cancer diagnosis. However, survival prediction remains a challenge. In this study, we developed a pipeline named "stacked ensemble of machine learning models for methylation-correlated blocks" (EnMCB) that combined Cox regression, support vector regression (SVR), and elastic-net models to construct signatures based on DNA methylation-correlated blocks for lung adenocarcinoma (LUAD) survival prediction. We used methylation profiles from the Cancer Genome Atlas (TCGA) as the training set, and profiles from the Gene Expression Omnibus (GEO) as validation and testing sets. First, we partitioned the genome into blocks of tightly co-methylated CpG sites, which we termed methylation-correlated blocks (MCBs). After partitioning and feature selection, we observed different diagnostic capacities for predicting patient survival across the models. We combined the multiple models into a single stacking ensemble model. The stacking ensemble model based on the top-ranked block had the area under the receiver operating characteristic curve of 0.622 in the TCGA training set, 0.773 in the validation set, and 0.698 in the testing set. When stratified by clinicopathological risk factors, the risk score predicted by the top-ranked MCB was an independent prognostic factor. Our results showed that our pipeline was a reliable tool that may facilitate MCB selection and survival prediction.
基于甲基转移酶和去甲基酶可修饰同一DNA链中相邻的胞嘧啶-硫代磷酸酯-鸟嘌呤(CpG)位点这一知识,我们发现将多个CpG组合成一个单一模块可能会改善癌症诊断。然而,生存预测仍然是一项挑战。在本研究中,我们开发了一种名为“用于甲基化相关模块的机器学习模型堆叠集成”(EnMCB)的流程,该流程结合了Cox回归、支持向量回归(SVR)和弹性网络模型,以基于DNA甲基化相关模块构建特征,用于肺腺癌(LUAD)生存预测。我们将来自癌症基因组图谱(TCGA)的甲基化谱作为训练集,将来自基因表达综合数据库(GEO)的谱作为验证集和测试集。首先,我们将基因组划分为紧密共甲基化的CpG位点模块,我们将其称为甲基化相关模块(MCB)。在划分和特征选择后,我们观察到不同模型在预测患者生存方面具有不同的诊断能力。我们将多个模型组合成一个单一的堆叠集成模型。基于排名靠前模块的堆叠集成模型在TCGA训练集中的受试者操作特征曲线下面积为0.622,在验证集中为0.773,在测试集中为0.698。当按临床病理风险因素分层时,排名靠前的MCB预测的风险评分是一个独立的预后因素。我们的结果表明,我们的流程是一种可靠的工具,可能有助于MCB选择和生存预测。