The Geisel School of Medicine, Department of Biomedical Data Science, Dartmouth College, HB7936, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Beirut, NH, 03756, Lebanon.
Department of Statistics, Rice University, M.S. 138, 6100 Main Street, Houston, TX, 77005, USA.
BMC Bioinformatics. 2018 Nov 19;19(1):430. doi: 10.1186/s12859-018-2455-0.
Because driver mutations provide selective advantage to the mutant clone, they tend to occur at a higher frequency in tumor samples compared to selectively neutral (passenger) mutations. However, mutation frequency alone is insufficient to identify cancer genes because mutability is influenced by many gene characteristics, such as size, nucleotide composition, etc. The goal of this study was to identify gene characteristics associated with the frequency of somatic mutations in the gene in tumor samples.
We used data on somatic mutations detected by genome wide screens from the Catalog of Somatic Mutations in Cancer (COSMIC). Gene size, nucleotide composition, expression level of the gene, relative replication time in the cell cycle, level of evolutionary conservation and other gene characteristics (totaling 11) were used as predictors of the number of somatic mutations. We applied stepwise multiple linear regression to predict the number of mutations per gene. Because missense, nonsense, and frameshift mutations are associated with different sets of gene characteristics, they were modeled separately. Gene characteristics explain 88% of the variation in the number of missense, 40% of nonsense, and 23% of frameshift mutations. Comparisons of the observed and expected numbers of mutations identified genes with a higher than expected number of mutations- positive outliers. Many of these are known driver genes. A number of novel candidate driver genes was also identified.
By comparing the observed and predicted number of mutations in a gene, we have identified known cancer-associated genes as well as 111 novel cancer associated genes. We also showed that adding the number of silent mutations per gene reported by genome/exome wide screens across all cancer type (COSMIC data) as a predictor substantially exceeds predicting accuracy of the most popular cancer gene predicting tool - MutsigCV.
由于驱动突变为突变克隆提供了选择优势,因此与选择性中性(乘客)突变相比,它们在肿瘤样本中更常发生。然而,突变频率本身不足以确定癌症基因,因为突变率受许多基因特征的影响,如大小、核苷酸组成等。本研究的目的是确定与肿瘤样本中基因体细胞突变频率相关的基因特征。
我们使用了来自癌症体细胞突变目录(COSMIC)的全基因组筛选检测到的体细胞突变数据。基因大小、核苷酸组成、基因表达水平、细胞周期中的相对复制时间、进化保守程度和其他基因特征(共 11 个)被用作预测体细胞突变数量的指标。我们应用逐步多元线性回归来预测每个基因的突变数量。由于错义、无义和移码突变与不同的基因特征集相关,因此它们分别进行建模。基因特征解释了错义突变数量的 88%、无义突变数量的 40%和移码突变数量的 23%。观察到的和预期的突变数量之间的比较确定了具有高于预期突变数量的基因——阳性异常值。其中许多是已知的驱动基因。还确定了一些新的候选驱动基因。
通过比较基因中观察到和预测到的突变数量,我们确定了已知的与癌症相关的基因以及 111 个新的与癌症相关的基因。我们还表明,将全癌种(COSMIC 数据)基因组/外显子组全筛选报告的每个基因的沉默突变数量添加为预测指标,显著超过了最流行的癌症基因预测工具——MutsigCV 的预测准确性。