Liu Mo, Wu Yang, Jiang Nanhai, Boot Arnoud, Rozen Steven G
Programme in Cancer & Stem Cell Biology, Duke-NUS Medical School, 169857 Singapore.
Centre for Computational Biology, Duke-NUS Medical School, 169857 Singapore.
NAR Genom Bioinform. 2023 Jan 23;5(1):lqad005. doi: 10.1093/nargab/lqad005. eCollection 2023 Mar.
Mutational signatures are characteristic patterns of mutations caused by endogenous or exogenous mutational processes. These signatures can be discovered by analyzing mutations in large sets of samples-usually somatic mutations in tumor samples. Most programs for discovering mutational signatures are based on non-negative matrix factorization (NMF). Alternatively, signatures can be discovered using hierarchical Dirichlet process (HDP) mixture models, an approach that has been less explored. These models assign mutations to clusters and view each cluster as being generated from the signature of a particular mutational process. Here, we describe mSigHdp, an improved approach to using HDP mixture models to discover mutational signatures. We benchmarked mSigHdp and state-of-the-art NMF-based approaches on four realistic synthetic data sets. These data sets encompassed 18 cancer types. In total, they contained 3.5 × 10 single-base-substitution mutations representing 32 signatures and 6.1 × 10 small insertion and deletion mutations representing 13 signatures. For three of the four data sets, mSigHdp had the best positive predictive value for discovering mutational signatures, and for all four data sets, it had the best true positive rate. Its CPU usage was similar to that of the NMF-based approaches. Thus, mSigHdp is an important and practical addition to the set of tools available for discovering mutational signatures.
突变特征是由内源性或外源性突变过程引起的突变特征模式。这些特征可以通过分析大量样本中的突变来发现——通常是肿瘤样本中的体细胞突变。大多数发现突变特征的程序都基于非负矩阵分解(NMF)。另外,也可以使用层次狄利克雷过程(HDP)混合模型来发现特征,这种方法的探索较少。这些模型将突变分配到不同簇,并将每个簇视为由特定突变过程的特征产生的。在这里,我们描述了mSigHdp,这是一种使用HDP混合模型发现突变特征的改进方法。我们在四个逼真的合成数据集上对mSigHdp和基于NMF的先进方法进行了基准测试。这些数据集涵盖了18种癌症类型。它们总共包含3.5×10个单碱基替换突变,代表32个特征,以及6.1×10个小插入和缺失突变,代表13个特征。对于四个数据集中的三个,mSigHdp在发现突变特征方面具有最佳的阳性预测值,对于所有四个数据集,它具有最佳的真阳性率。其CPU使用率与基于NMF的方法相似。因此,mSigHdp是可用于发现突变特征的工具集的一个重要且实用的补充。