Department of Molecular Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus N, DK-8200, Denmark.
Bioinformatics Research Centre, Aarhus University, C.F. Mollers Alle 8, Aarhus C, DK-8000, Denmark.
BMC Bioinformatics. 2018 Apr 19;19(1):147. doi: 10.1186/s12859-018-2141-2.
Detailed modelling of the neutral mutational process in cancer cells is crucial for identifying driver mutations and understanding the mutational mechanisms that act during cancer development. The neutral mutational process is very complex: whole-genome analyses have revealed that the mutation rate differs between cancer types, between patients and along the genome depending on the genetic and epigenetic context. Therefore, methods that predict the number of different types of mutations in regions or specific genomic elements must consider local genomic explanatory variables. A major drawback of most methods is the need to average the explanatory variables across the entire region or genomic element. This procedure is particularly problematic if the explanatory variable varies dramatically in the element under consideration.
To take into account the fine scale of the explanatory variables, we model the probabilities of different types of mutations for each position in the genome by multinomial logistic regression. We analyse 505 cancer genomes from 14 different cancer types and compare the performance in predicting mutation rate for both regional based models and site-specific models. We show that for 1000 randomly selected genomic positions, the site-specific model predicts the mutation rate much better than regional based models. We use a forward selection procedure to identify the most important explanatory variables. The procedure identifies site-specific conservation (phyloP), replication timing, and expression level as the best predictors for the mutation rate. Finally, our model confirms and quantifies certain well-known mutational signatures.
We find that our site-specific multinomial regression model outperforms the regional based models. The possibility of including genomic variables on different scales and patient specific variables makes it a versatile framework for studying different mutational mechanisms. Our model can serve as the neutral null model for the mutational process; regions that deviate from the null model are candidates for elements that drive cancer development.
详细模拟癌细胞中的中性突变过程对于识别驱动突变和理解癌症发生过程中起作用的突变机制至关重要。中性突变过程非常复杂:全基因组分析表明,突变率因癌症类型、患者和基因组而异,这取决于遗传和表观遗传背景。因此,预测基因组区域或特定基因组元件中不同类型突变数量的方法必须考虑局部基因组解释变量。大多数方法的一个主要缺点是需要在整个区域或基因组元件上平均解释变量。如果考虑中的解释变量在元素中变化很大,那么该程序尤其成问题。
为了考虑解释变量的精细尺度,我们通过多项逻辑回归为基因组中的每个位置建模不同类型突变的概率。我们分析了来自 14 种不同癌症类型的 505 个癌症基因组,并比较了区域模型和特定部位模型在预测突变率方面的性能。我们表明,对于 1000 个随机选择的基因组位置,特定部位模型比基于区域的模型更好地预测突变率。我们使用前向选择程序来识别最重要的解释变量。该程序确定特定部位的保守性(phyloP)、复制时间和表达水平是突变率的最佳预测因子。最后,我们的模型证实并量化了某些已知的突变特征。
我们发现我们的特定部位多项回归模型优于基于区域的模型。包含不同尺度的基因组变量和患者特定变量的可能性使其成为研究不同突变机制的通用框架。我们的模型可以作为突变过程的中性零模型;偏离零模型的区域是驱动癌症发展的候选元素。