Steinberg Julia, Honti Frantisek, Meader Stephen, Webber Caleb
MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3PT, UK The Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK.
MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3PT, UK.
Nucleic Acids Res. 2015 Sep 3;43(15):e101. doi: 10.1093/nar/gkv474. Epub 2015 May 22.
Any given human individual carries multiple genetic variants that disrupt protein-coding genes, through structural variation, as well as nucleotide variants and indels. Predicting the phenotypic consequences of a gene disruption remains a significant challenge. Current approaches employ information from a range of biological networks to predict which human genes are haploinsufficient (meaning two copies are required for normal function) or essential (meaning at least one copy is required for viability). Using recently available study gene sets, we show that these approaches are strongly biased towards providing accurate predictions for well-studied genes. By contrast, we derive a haploinsufficiency score from a combination of unbiased large-scale high-throughput datasets, including gene co-expression and genetic variation in over 6000 human exomes. Our approach provides a haploinsufficiency prediction for over twice as many genes currently unassociated with papers listed in Pubmed as three commonly-used approaches, and outperforms these approaches for predicting haploinsufficiency for less-studied genes. We also show that fine-tuning the predictor on a set of well-studied 'gold standard' haploinsufficient genes does not improve the prediction for less-studied genes. This new score can readily be used to prioritize gene disruptions resulting from any genetic variant, including copy number variants, indels and single-nucleotide variants.
任何一个人类个体都携带多种通过结构变异以及核苷酸变异和插入缺失来破坏蛋白质编码基因的遗传变异。预测基因破坏的表型后果仍然是一项重大挑战。当前的方法利用一系列生物网络中的信息来预测哪些人类基因是单倍体不足的(即正常功能需要两个拷贝)或必需的(即生存至少需要一个拷贝)。利用最近可得的研究基因集,我们表明这些方法在为研究充分的基因提供准确预测方面存在强烈偏差。相比之下,我们从无偏差的大规模高通量数据集(包括基因共表达和6000多个人类外显子组中的遗传变异)的组合中得出单倍体不足评分。我们的方法为目前与PubMed列出的论文无关联的基因提供的单倍体不足预测数量是三种常用方法的两倍多,并且在预测研究较少的基因的单倍体不足方面优于这些方法。我们还表明,在一组研究充分的“金标准”单倍体不足基因上微调预测器并不能改善对研究较少的基因的预测。这个新评分可以很容易地用于对任何遗传变异(包括拷贝数变异、插入缺失和单核苷酸变异)导致的基因破坏进行优先级排序。