Jiang Qin, Jin Min
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China.
Front Genet. 2021 Feb 26;12:629946. doi: 10.3389/fgene.2021.629946. eCollection 2021.
Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer.
探索乳腺癌的分子机制对于癌症患者的早期预测、诊断和治疗至关重要。从高通量测序技术获得的大规模数据使得识别驱动突变以及对癌症分类至关重要的最小最优基因集变得困难。在本研究中,我们提出了一种无需任何先验信息的新方法来识别与乳腺癌相关的突变基因。对于体细胞突变数据,将其处理为突变矩阵,从中可以获得每个基因的突变频率。通过为突变频率设置合理的阈值,从突变矩阵中筛选出突变基因集。对于基因表达数据,用其生成基因表达矩阵,同时将突变基因集映射到该矩阵上以构建共表达谱。在特征选择阶段,我们提出了一种分阶段特征选择算法,使用倍数变化、错误发现率来选择差异表达基因,互信息来去除不相关和冗余特征,以及基于梯度提升决策树和贝叶斯优化的嵌入式方法来获得最优模型。在评估阶段,我们提出了一种加权度量来修正传统准确率以解决样本不平衡问题。我们将所提出的方法应用于癌症基因组图谱乳腺癌数据,并识别出一个突变基因集,其中涉及的基因是先前报道与致癌作用相关的癌基因或肿瘤抑制基因。作为与整合网络的比较,我们还在个体基因表达和金标准PMA50上执行最优模型。结果表明,在大多数指标的平均值上,整合网络优于基因表达和PMA50,这表明我们所提出的通过整合多个数据源的方法是有效的,并且能够发现乳腺癌中相关的突变基因。