Department of Medicine, The University of Melbourne, Austin Health and Royal Melbourne Hospital, Melbourne, Victoria 3010, Australia.
Simcere Diagnostics, Nanjing, 210042, China.
Genome Res. 2017 Oct;27(10):1715-1729. doi: 10.1101/gr.226589.117. Epub 2017 Sep 1.
Gene panel and exome sequencing have revealed a high rate of molecular diagnoses among diseases where the genetic architecture has proven suitable for sequencing approaches, with a large number of distinct and highly penetrant causal variants identified among a growing list of disease genes. The challenge is, given the DNA sequence of a new patient, to distinguish disease-causing from benign variants. Large samples of human standing variation data highlight regional variation in the tolerance to missense variation within the protein-coding sequence of genes. This information is not well captured by existing bioinformatic tools, but is effective in improving variant interpretation. To address this limitation in existing tools, we introduce the missense tolerance ratio (MTR), which summarizes available human standing variation data within genes to encapsulate population level genetic variation. We find that patient-ascertained pathogenic variants preferentially cluster in low MTR regions ( < 0.005) of well-informed genes. By evaluating 20 publicly available predictive tools across genes linked to epilepsy, we also highlight the importance of understanding the empirical null distribution of existing prediction tools, as these vary across genes. Subsequently integrating the MTR with the empirically selected bioinformatic tools in a gene-specific approach demonstrates a clear improvement in the ability to predict pathogenic missense variants from background missense variation in disease genes. Among an independent test sample of case and control missense variants, case variants (0.83 median score) consistently achieve higher pathogenicity prediction probabilities than control variants (0.02 median score; Mann-Whitney test, < 1 × 10). We focus on the application to epilepsy genes; however, the framework is applicable to disease genes beyond epilepsy.
基因panel 和外显子测序已经在那些遗传结构适合测序方法的疾病中揭示了高比例的分子诊断,在不断增长的疾病基因列表中,已经确定了许多不同的、高度外显的因果变体。挑战在于,给定新患者的 DNA 序列,要区分致病变体和良性变体。大量人类固定变异数据突出了蛋白质编码序列中错义变异容忍度的区域差异。这些信息无法很好地被现有生物信息学工具捕捉,但在改善变体解释方面非常有效。为了解决现有工具中的这一限制,我们引入了错义容忍比(MTR),它总结了基因内可用的人类固定变异数据,以封装群体水平的遗传变异。我们发现,患者确定的致病性变体优先聚集在信息丰富的基因中 MTR 值较低(<0.005)的区域。通过评估与癫痫相关的 20 个公开可用的预测工具,我们还强调了理解现有预测工具的经验性无效分布的重要性,因为这些分布在基因之间存在差异。随后,在基因特异性方法中,将 MTR 与经验选择的生物信息学工具相结合,清楚地提高了从疾病基因中的背景错义变异中预测致病性错义变体的能力。在病例和对照错义变异的独立测试样本中,病例变异(中位数评分 0.83)始终比对照变异(中位数评分 0.02;Mann-Whitney 检验,<1×10-15)具有更高的致病性预测概率。我们专注于癫痫基因的应用;然而,该框架适用于癫痫以外的疾病基因。