Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
Cancer Data Science Laboratory, National Cancer Institute, NIH, Bethesda, MD, USA.
Bioinformatics. 2019 Jul 15;35(14):i492-i500. doi: 10.1093/bioinformatics/btz340.
Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures. Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis. Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g. smoking history) or molecular features (e.g. inactivations to DNA damage repair genes).
To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures. To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor's observed covariates. We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations. On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures. We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. We also discover four signatures in a combined melanoma and lung cancer cohort-using cancer type as a covariate-and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas.
TCSM is implemented in Python 3 and available at https://github.com/lrgr/tcsm, along with a data workflow for reproducing the experiments in the paper.
Supplementary data are available at Bioinformatics online.
体细胞突变是由与 DNA 复制或环境/生活方式暴露相关的过程引起的。了解肿瘤中突变过程的活性可以为个性化治疗、早期检测和肿瘤发生提供信息。计算方法已经揭示了 30 个在人类癌症中活跃的突变过程的验证特征,每个特征都是单碱基替换的模式。然而,其中一半的特征没有已知的病因,一些类似的特征有不同的病因,使得突变特征活性的模式难以解释。现有的突变特征检测方法没有考虑肿瘤水平的临床/人口统计学(如吸烟史)或分子特征(如 DNA 损伤修复基因的失活)。
为了开始解决这些挑战,我们提出了肿瘤协变量特征模型(TCSM),这是第一个直接对观察到的肿瘤水平协变量对突变特征的影响进行建模的方法。为此,我们的模型使用贝叶斯主题建模方法,根据肿瘤观察到的协变量来改变对特征暴露的先验分布。我们还引入了在保留数据中估算协变量的方法和评估特征-协变量关联的统计显著性的方法。在模拟和真实数据上,我们发现 TCSM 优于非负矩阵分解和基于主题建模的方法,特别是在恢复相似特征的真实暴露方面。然后,我们使用 TCSM 在乳腺癌中发现了五个突变特征,并预测了保留肿瘤中的同源重组修复缺陷。我们还在合并的黑色素瘤和肺癌队列中发现了四个特征——使用癌症类型作为协变量——并提供了统计证据支持早先的说法,即来自癌症基因组图谱的三个肺癌是误诊的转移性黑色素瘤。
TCSM 是用 Python 3 实现的,可在 https://github.com/lrgr/tcsm 上获得,同时还提供了一个用于重现论文中实验的数据工作流程。
补充数据可在生物信息学在线获得。