Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India.
Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
PLoS One. 2019 Sep 6;14(9):e0221476. doi: 10.1371/journal.pone.0221476. eCollection 2019.
Liver Hepatocellular Carcinoma (LIHC) is one of the major cancers worldwide, responsible for millions of premature deaths every year. Prediction of clinical staging is vital to implement optimal therapeutic strategy and prognostic prediction in cancer patients. However, to date, no method has been developed for predicting the stage of LIHC from the genomic profile of samples.
The Cancer Genome Atlas (TCGA) dataset of 173 early stage (stage-I), 177 late stage (stage-II, Stage-III and stage-IV) and 50 adjacent normal tissue samples for 60,483 RNA transcripts and 485,577 methylation CpG sites, was extensively analyzed to identify the key transcriptomic expression and methylation-based features using different feature selection techniques. Further, different classification models were developed based on selected key features to categorize different classes of samples implementing different machine learning algorithms.
In the current study, in silico models have been developed for classifying LIHC patients in the early vs. late stage and cancerous vs. normal samples using RNA expression and DNA methylation data. TCGA datasets were extensively analyzed to identify differentially expressed RNA transcripts and methylated CpG sites that can discriminate early vs. late stages and cancer vs. normal samples of LIHC with high precision. Naive Bayes model developed using 51 features that combine 21 CpG methylation sites and 30 RNA transcripts achieved maximum MCC (Matthew's correlation coefficient) 0.58 with an accuracy of 78.87% on the validation dataset in discrimination of early and late stage. Additionally, the prediction models developed based on 5 RNA transcripts and 5 CpG sites classify LIHC and normal samples with an accuracy of 96-98% and AUC (Area Under the Receiver Operating Characteristic curve) 0.99. Besides, multiclass models also developed for classifying samples in the normal, early and late stage of cancer and achieved an accuracy of 76.54% and AUC of 0.86.
Our study reveals stage prediction of LIHC samples with high accuracy based on the genomics and epigenomics profiling is a challenging task in comparison to the classification of cancerous and normal samples. Comprehensive analysis, differentially expressed RNA transcripts, methylated CpG sites in LIHC samples and prediction models are available from CancerLSP (http://webs.iiitd.edu.in/raghava/cancerlsp/).
肝癌(LIHC)是全球主要癌症之一,每年导致数百万人过早死亡。预测临床分期对于癌症患者实施最佳治疗策略和预后预测至关重要。然而,迄今为止,尚无方法可从样本的基因组谱中预测 LIHC 的分期。
对 173 例早期(I 期)、177 例晚期(II 期、III 期和 IV 期)和 50 例相邻正常组织样本的癌症基因组图谱(TCGA)数据集进行了广泛分析,以使用不同的特征选择技术识别关键转录组表达和基于甲基化的特征。此外,基于选定的关键特征开发了不同的分类模型,以使用不同的机器学习算法对不同类别的样本进行分类。
在本研究中,使用 RNA 表达和 DNA 甲基化数据,为早期与晚期 LIHC 患者以及癌症与正常样本的分类开发了计算机模型。对 TCGA 数据集进行了广泛分析,以识别可区分 LIHC 早期与晚期以及癌症与正常样本的差异表达 RNA 转录物和甲基化 CpG 位点,具有高精度。使用结合了 21 个 CpG 甲基化位点和 30 个 RNA 转录物的 51 个特征开发的朴素贝叶斯模型在验证数据集中实现了最大 MCC(马修相关系数)0.58,准确率为 78.87%。此外,基于 5 个 RNA 转录物和 5 个 CpG 位点开发的预测模型可将 LIHC 和正常样本分类,准确率为 96-98%,AUC(接收器工作特征曲线下的面积)为 0.99。此外,还开发了用于分类正常、早期和晚期癌症样本的多类模型,准确率为 76.54%,AUC 为 0.86。
与癌症和正常样本的分类相比,基于基因组学和表观基因组学谱对 LIHC 样本进行高精度分期预测是一项具有挑战性的任务。癌症 LSP(http://webs.iiitd.edu.in/raghava/cancerlsp/)提供了全面的分析、LIHC 样本中差异表达的 RNA 转录物、甲基化的 CpG 位点和预测模型。