Institute of Molecular and Clinical Ophthalmology Basel, 4031 Basel, Switzerland; Department of Ophthalmology, University of Basel, 4031 Basel, Switzerland; Department of Genetics and Genome Biology, University of Leicester, Leicester LE1 7RH, UK.
Institute of Molecular and Clinical Ophthalmology Basel, 4031 Basel, Switzerland; Department of Ophthalmology, University of Basel, 4031 Basel, Switzerland; Department of Genetics and Genome Biology, University of Leicester, Leicester LE1 7RH, UK; Institute of Experimental Pathology, Lausanne University Hospital (CHUV), 1011 Lausanne, Switzerland.
Am J Hum Genet. 2022 Mar 3;109(3):457-470. doi: 10.1016/j.ajhg.2022.01.006. Epub 2022 Feb 3.
We used a machine learning approach to analyze the within-gene distribution of missense variants observed in hereditary conditions and cancer. When applied to 840 genes from the ClinVar database, this approach detected a significant non-random distribution of pathogenic and benign variants in 387 (46%) and 172 (20%) genes, respectively, revealing that variant clustering is widespread across the human exome. This clustering likely occurs as a consequence of mechanisms shaping pathogenicity at the protein level, as illustrated by the overlap of some clusters with known functional domains. We then took advantage of these findings to develop a pathogenicity predictor, MutScore, that integrates qualitative features of DNA substitutions with the new additional information derived from this positional clustering. Using a random forest approach, MutScore was able to identify pathogenic missense mutations with very high accuracy, outperforming existing predictive tools, especially for variants associated with autosomal-dominant disease and cancer. Thus, the within-gene clustering of pathogenic and benign DNA changes is an important and previously underappreciated feature of the human exome, which can be harnessed to improve the prediction of pathogenicity and disambiguation of DNA variants of uncertain significance.
我们使用机器学习方法分析了遗传性疾病和癌症中观察到的错义变异在基因内的分布。当将此方法应用于 ClinVar 数据库中的 840 个基因时,分别在 387 个(46%)和 172 个(20%)基因中检测到致病性和良性变异的显著非随机分布,这表明变异聚类在人类外显子中广泛存在。这种聚类可能是由于在蛋白质水平上塑造致病性的机制所致,一些聚类与已知的功能域重叠就是例证。然后,我们利用这些发现开发了一种致病性预测器 MutScore,它将 DNA 取代的定性特征与从这种位置聚类中获得的新的附加信息相结合。使用随机森林方法,MutScore 能够非常准确地识别致病性错义突变,优于现有的预测工具,尤其是对于与常染色体显性疾病和癌症相关的变体。因此,致病性和良性 DNA 变化的基因内聚类是人类外显子的一个重要且以前未被充分认识的特征,可以利用它来提高致病性预测和不确定意义的 DNA 变体的解析能力。