Pancotti Corrado, Rollo Cesare, Birolo Giovanni, Benevenuta Silvia, Fariselli Piero, Sanavia Tiziana
Department of Medical Sciences, University of Torino, Torino, Italy.
Front Genet. 2023 Jan 4;13:1049501. doi: 10.3389/fgene.2022.1049501. eCollection 2022.
The high cosine similarity between some single-base substitution mutational signatures and their characteristic flat profiles could suggest the presence of overfitting and mathematical artefacts. The newest version (v3.3) of the signature database available in the Catalogue Of Somatic Mutations In (COSMIC) provides a collection of 79 mutational signatures, which has more than doubled with respect to previous version (30 profiles available in COSMIC signatures v2), making more critical the associations between signatures and specific mutagenic processes. This study both provides a systematic assessment of the extraction task through simulation scenarios based on the latest version of the COSMIC signatures and highlights, through a novel approach using archetypal analysis, which COSMIC signatures are redundant and more likely to be considered as mathematical artefacts. 29 archetypes were able to reconstruct the profile of all the COSMIC signatures with cosine similarity 0.8. Interestingly, these archetypes tend to group similar original signatures sharing either the same aetiology or similar biological processes. We believe that these findings will be useful to encourage the development of new extraction methods avoiding the redundancy of information among the signatures while preserving the biological interpretation.
一些单碱基替换突变特征与其特征性平坦图谱之间的高余弦相似度可能表明存在过拟合和数学假象。《癌症体细胞突变目录》(COSMIC)中可用的特征数据库的最新版本(v3.3)提供了79个突变特征的集合,与先前版本(COSMIC特征v2中有30个图谱)相比增加了一倍多,这使得特征与特定诱变过程之间的关联变得更加关键。本研究既通过基于COSMIC特征最新版本的模拟场景对特征提取任务进行了系统评估,又通过使用原型分析的新方法突出了哪些COSMIC特征是冗余的,并且更有可能被视为数学假象。29个原型能够以余弦相似度≥0.8重建所有COSMIC特征的图谱。有趣的是,这些原型倾向于将具有相同病因或相似生物学过程的相似原始特征归为一组。我们相信这些发现将有助于鼓励开发新的特征提取方法,在保留生物学解释的同时避免特征之间信息的冗余。