Shuaibi Ahmed, Chitra Uthsav, Raphael Benjamin J
Department of Computer Science, Princeton University.
Lewis-Sigler Institute for Integrative Genomics, Princeton University.
bioRxiv. 2024 Apr 27:2024.04.24.590995. doi: 10.1101/2024.04.24.590995.
A key challenge in cancer genomics is understanding the functional relationships and dependencies between combinations of somatic mutations that drive cancer development. Such mutations frequently exhibit patterns of or across tumors, and many methods have been developed to identify such dependency patterns from bulk DNA sequencing data of a cohort of patients. However, while mutual exclusivity and co-occurrence are described as properties of driver mutations, existing methods do not explicitly disentangle functional, driver mutations from neutral, mutations. In particular, nearly all existing methods evaluate mutual exclusivity or co-occurrence at the gene level, marking a gene as mutated if any mutation - driver or passenger - is present. Since some genes have a large number of passenger mutations, existing methods either restrict their analyses to a small subset of suspected driver genes - limiting their ability to identify novel dependencies - or make spurious inferences of mutual exclusivity and co-occurrence involving genes with many passenger mutations. We introduce DIALECT, an algorithm to identify dependencies between pairs of mutations from somatic mutation counts. We derive a latent variable mixture model for drivers and passengers that combines existing probabilistic models of passenger mutation rates with a latent variable describing the unknown status of a mutation as a driver or passenger. We use an expectation maximization (EM) algorithm to estimate the parameters of our model, including the rates of mutually exclusivity and co-occurrence between drivers. We demonstrate that DIALECT more accurately infers mutual exclusivity and co-occurrence between driver mutations compared to existing methods on both simulated mutation data and somatic mutation data from 5 cancer types in The Cancer Genome Atlas (TCGA).
癌症基因组学中的一个关键挑战是理解驱动癌症发展的体细胞突变组合之间的功能关系和依赖性。此类突变在肿瘤之间经常呈现互斥或共现模式,并且已经开发了许多方法来从一组患者的大量DNA测序数据中识别这种依赖性模式。然而,虽然互斥性和共现性被描述为驱动突变的特性,但现有方法并未明确区分功能性驱动突变和中性乘客突变。特别是,几乎所有现有方法都在基因水平评估互斥性或共现性,如果存在任何突变(驱动或乘客突变),则将该基因标记为突变。由于一些基因有大量乘客突变,现有方法要么将其分析限制在一小部分疑似驱动基因上——限制了它们识别新依赖性的能力——要么对涉及有许多乘客突变的基因进行互斥性和共现性的虚假推断。我们引入了DIALECT,一种从体细胞突变计数中识别成对突变之间依赖性的算法。我们推导了一个驱动基因和乘客基因的潜在变量混合模型,该模型将现有的乘客突变率概率模型与一个描述突变作为驱动基因或乘客基因的未知状态的潜在变量相结合。我们使用期望最大化(EM)算法来估计我们模型的参数,包括驱动基因之间的互斥率和共现率。我们证明,与现有方法相比,DIALECT在模拟突变数据和来自癌症基因组图谱(TCGA)中5种癌症类型的体细胞突变数据上,能更准确地推断驱动突变之间的互斥性和共现性。