Li Peng, Guo Maozu, Sun Bo
School of Artificial Intelligence, Beijing Normal University, Beijing 100875, P. R. China.
School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, P. R. China.
J Bioinform Comput Biol. 2019 Dec;17(6):1950038. doi: 10.1142/S0219720019500380.
The identification of cancer-related genes is a major research goal, with implications for determining the pathogenesis of cancer and identifying biomarkers for early diagnosis and treatment. In this study, by integrating multi-omics data, including gene expression, DNA copy number variation, DNA methylation, transcription factors, miRNA, and lncRNA data, we propose a method for mining cancer-related genes based on network models. First, using random forest-based feature selection method multi-omics data are integrated to identify key regulatory factors that affect gene expression, and then genome-wide regulatory networks are constructed. Next, by comparing the regulatory networks of key candidate genes in variant samples and non-variant samples, a differential expression regulatory network is generated. The differential network contains a collection of abnormal regulatory genes of key candidate genes. Then, by introducing the functional similarity as a distance metric for gene sets, a density-based clustering method is used to mine gene modules related to cancer. We applied this method to LUSC (lung squamous cell carcinoma) and mined cancer-related gene modules composed of 20 genes. GO function and KEGG pathway analyses indicated that the modules were closely related to cancer. A survival analysis was used to verify that the excavated gene modules can effectively distinguish between high- and low-risk groups. Overall, these results suggest that the proposed method can be used to identify cancer-related gene modules, providing a basis for the development of biomarkers for diagnosis and treatment.
识别癌症相关基因是一个主要的研究目标,对确定癌症的发病机制以及识别早期诊断和治疗的生物标志物具有重要意义。在本研究中,通过整合多组学数据,包括基因表达、DNA拷贝数变异、DNA甲基化、转录因子、miRNA和lncRNA数据,我们提出了一种基于网络模型挖掘癌症相关基因的方法。首先,使用基于随机森林的特征选择方法整合多组学数据,以识别影响基因表达的关键调控因子,然后构建全基因组调控网络。接下来,通过比较变异样本和非变异样本中关键候选基因的调控网络,生成差异表达调控网络。差异网络包含关键候选基因的异常调控基因集合。然后,通过引入功能相似性作为基因集的距离度量,使用基于密度的聚类方法挖掘与癌症相关的基因模块。我们将此方法应用于肺鳞状细胞癌(LUSC),挖掘出由20个基因组成的癌症相关基因模块。基因本体(GO)功能和京都基因与基因组百科全书(KEGG)通路分析表明,这些模块与癌症密切相关。生存分析用于验证挖掘出的基因模块能够有效区分高风险组和低风险组。总体而言,这些结果表明所提出的方法可用于识别癌症相关基因模块,为开发诊断和治疗的生物标志物提供依据。