Department of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden.
Science for Life Laboratory, Stockholm, Sweden.
PLoS Comput Biol. 2022 Dec 5;18(12):e1010732. doi: 10.1371/journal.pcbi.1010732. eCollection 2022 Dec.
Identifying the interrelations among cancer driver genes and the patterns in which the driver genes get mutated is critical for understanding cancer. In this paper, we study cross-sectional data from cohorts of tumors to identify the cancer-type (or subtype) specific process in which the cancer driver genes accumulate critical mutations. We model this mutation accumulation process using a tree, where each node includes a driver gene or a set of driver genes. A mutation in each node enables its children to have a chance of mutating. This model simultaneously explains the mutual exclusivity patterns observed in mutations in specific cancer genes (by its nodes) and the temporal order of events (by its edges). We introduce a computationally efficient dynamic programming procedure for calculating the likelihood of our noisy datasets and use it to build our Markov Chain Monte Carlo (MCMC) inference algorithm, ToMExO. Together with a set of engineered MCMC moves, our fast likelihood calculations enable us to work with datasets with hundreds of genes and thousands of tumors, which cannot be dealt with using available cancer progression analysis methods. We demonstrate our method's performance on several synthetic datasets covering various scenarios for cancer progression dynamics. Then, a comparison against two state-of-the-art methods on a moderate-size biological dataset shows the merits of our algorithm in identifying significant and valid patterns. Finally, we present our analyses of several large biological datasets, including colorectal cancer, glioblastoma, and pancreatic cancer. In all the analyses, we validate the results using a set of method-independent metrics testing the causality and significance of the relations identified by ToMExO or competing methods.
鉴定癌症驱动基因之间的相互关系以及驱动基因发生突变的模式对于理解癌症至关重要。在本文中,我们研究了肿瘤队列的横断面数据,以确定癌症驱动基因积累关键突变的癌症类型(或亚型)特异性过程。我们使用一棵树来建模这个突变积累过程,其中每个节点包含一个驱动基因或一组驱动基因。每个节点中的突变使它的子节点有突变的机会。该模型同时解释了在特定癌症基因(通过其节点)中观察到的突变相互排斥模式以及事件的时间顺序(通过其边)。我们引入了一种计算效率高的动态编程程序来计算我们噪声数据集的可能性,并使用它来构建我们的马尔可夫链蒙特卡罗(MCMC)推理算法 ToMExO。结合一组工程化的 MCMC 移动,我们快速的可能性计算使我们能够处理具有数百个基因和数千个肿瘤的数据集,而这些数据集无法使用现有的癌症进展分析方法处理。我们在涵盖癌症进展动态各种场景的几个合成数据集中展示了我们方法的性能。然后,在一个中等大小的生物数据集上与两种最先进的方法进行比较,显示了我们的算法在识别重要和有效模式方面的优势。最后,我们展示了对包括结直肠癌、胶质母细胞瘤和胰腺癌在内的几个大型生物数据集的分析。在所有分析中,我们使用一组独立于方法的度量标准来验证结果,这些标准测试了 ToMExO 或竞争方法识别的关系的因果关系和显著性。