Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran.
Department of Computer Engineering, Simon Fraser University, Burnaby, BC, 1S6, Canada.
BMC Bioinformatics. 2022 Jul 25;23(1):298. doi: 10.1186/s12859-022-04840-6.
The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, identifying many cancer-associated genes. Although cancers in the same tissue are widely categorized in the same group, they demonstrate many differences concerning their mutational profiles. Hence, there is no definitive treatment for most cancer types. This reveals the importance of developing new pipelines to identify cancer-associated genes accurately and re-classify patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types.
In this study, we propose a new machine learning pipeline to identify protein-coding genes mutated in many samples to identify cancer subtypes. We apply our pipeline to 12,270 samples collected from the international cancer genome consortium, covering 19 cancer types. As a result, we identify 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways.
This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. Additionally, we analyze the mutational signatures for samples in each subtype, which provides important insight into their active molecular mechanisms. Some of the pathways we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rates of mutation in multiple cancer subtypes. In addition, our study on "gene-motif" suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes. The source codes for our proposed clustering pipeline and analysis are publicly available at: https://github.com/bcb-sut/Pan-Cancer .
高通量测序的出现使研究人员能够系统地评估癌症中的遗传变异,从而确定许多与癌症相关的基因。尽管同一组织中的癌症通常被归为同一组,但它们在突变谱方面存在许多差异。因此,大多数癌症类型都没有明确的治疗方法。这表明开发新的管道来准确识别与癌症相关的基因并重新分类具有相似突变谱的患者非常重要。对具有相似突变谱的癌症患者进行分类可能有助于发现可能受益于特定治疗类型的癌症患者亚型。
在这项研究中,我们提出了一种新的机器学习管道,用于识别在许多样本中发生突变的蛋白质编码基因,以识别癌症亚型。我们将我们的管道应用于国际癌症基因组联盟收集的 12270 个样本,涵盖 19 种癌症类型。结果,我们确定了 17 种不同的癌症亚型。全面的表型和基因型分析表明具有可区分的特征,包括独特的癌症相关信号通路。
这种新的分型方法为基于患者突变谱的癌症药物开发提供了一个新的机会。此外,我们分析了每个亚型样本中的突变特征,这为它们的活跃分子机制提供了重要的见解。我们在大多数亚型中发现的一些途径,包括细胞周期和轴突导向途径,在癌症疾病中经常观察到。有趣的是,我们还在多个癌症亚型中发现了一些突变基因和不同的突变率。此外,我们对“基因基序”的研究表明,在识别与癌症相关的基因时,考虑突变的上下文和突变过程非常重要。我们提出的聚类管道和分析的源代码可在 https://github.com/bcb-sut/Pan-Cancer 上获得。