Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA.
Department of Biological Sciences, Clemson University, Clemson, SC 29634, USA.
Bioinformatics. 2021 Apr 20;37(3):396-403. doi: 10.1093/bioinformatics/btaa717.
Essential genes are required for the reproductive success at either cellular or organismal level. The identification of essential genes is important for understanding the core biological processes and identifying effective therapeutic drug targets. However, experimental identification of essential genes is costly, time consuming and labor intensive. Although several machine learning models have been developed to predict essential genes, these models are not readily applicable to lncRNAs. Moreover, the currently available models cannot be used to predict essential genes in a specific cancer type.
In this study, we have developed a new machine learning approach, XGEP (eXpression-based Gene Essentiality Prediction), to predict essential genes and candidate lncRNAs in cancer cells. The novelty of XGEP lies in the utilization of relevant features derived from the TCGA transcriptome dataset through collaborative embedding. When evaluated on the pan-cancer dataset, XGEP was able to accurately predict human essential genes and achieve significantly higher performance than previous models. Notably, several candidate lncRNAs selected by XGEP are reported to promote cell proliferation and inhibit cell apoptosis. Moreover, XGEP also demonstrated superior performance on cancer-type-specific datasets to identify essential genes. The comprehensive lists of candidate essential genes in specific cancer types may be used to guide experimental characterization and facilitate the discovery of drug targets for cancer therapy.
The source code and datasets used in this study are freely available at https://github.com/BioDataLearning/XGEP.
Supplementary data are available at Bioinformatics online.
对于细胞或机体水平的生殖成功而言,必需基因是必需的。必需基因的鉴定对于理解核心生物过程和确定有效的治疗药物靶点非常重要。然而,必需基因的实验鉴定既昂贵又耗时且费力。尽管已经开发了几种机器学习模型来预测必需基因,但这些模型不适用于 lncRNA。此外,目前可用的模型不能用于预测特定癌症类型中的必需基因。
在这项研究中,我们开发了一种新的机器学习方法,即 XGEP(基于表达的基因必需性预测),用于预测癌细胞中的必需基因和候选 lncRNA。XGEP 的新颖之处在于通过协作嵌入利用来自 TCGA 转录组数据集的相关特征。在泛癌数据集上进行评估时,XGEP 能够准确预测人类必需基因,并取得了显著优于以前模型的性能。值得注意的是,XGEP 选择的几个候选 lncRNA 据报道可促进细胞增殖并抑制细胞凋亡。此外,XGEP 在癌症类型特异性数据集上也表现出优越的性能,可用于识别必需基因。特定癌症类型中候选必需基因的综合列表可用于指导实验表征并促进癌症治疗药物靶点的发现。
本研究中使用的源代码和数据集可在 https://github.com/BioDataLearning/XGEP 上免费获得。
补充数据可在生物信息学在线获得。