Department of Mathematics, Aarhus University, Aarhus, Denmark.
BMC Bioinformatics. 2023 May 8;24(1):187. doi: 10.1186/s12859-023-05304-1.
The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate.
We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice.
With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .
在癌症基因组的集合中,突变谱可以通过几种突变特征的混合来描述。突变特征可以使用非负矩阵分解(NMF)找到。为了提取突变特征,我们必须假设观测到的突变计数的分布和突变特征的数量。在大多数应用中,突变计数被假设为泊松分布,并且通过使用经典的模型选择程序,比较相同基础分布但不同秩的几个模型的拟合情况来选择秩。然而,计数通常是过离散的,因此负二项式分布更为合适。
我们提出了一种具有患者特异性分散参数的负二项式 NMF,以捕捉患者之间的变化,并推导出相应的参数估计更新规则。我们还引入了一种受交叉验证启发的新的模型选择过程,以确定特征的数量。通过模拟,我们研究了分布假设对我们的方法的影响以及其他经典的模型选择过程。我们还进行了一项模拟研究和方法比较,结果表明,当存在过分散时,最先进的方法会高度高估特征的数量。我们将我们的提议分析应用于广泛的模拟数据和来自乳腺癌和前列腺癌患者的两个真实数据集。在真实数据上,我们描述了一种残差分析来调查和验证模型选择。
通过对模拟数据和真实数据的结果,我们表明我们的模型选择过程在模型误设的情况下更能稳健地确定正确的特征数量。我们还表明,我们的模型选择过程比文献中现有的方法更准确地找到真实的特征数量。最后,残差分析清楚地强调了突变计数数据的过离散性。我们的模型选择过程和负二项式 NMF 的代码可在 R 包 SigMoS 中获得,并可在 https://github.com/MartaPelizzola/SigMoS 找到。