Han Alexander L, Sands Chloe F, Matelska Dorota, Butts Jessica C, Ravanmehr Vida, Hu Fengyuan, Villavicencio Gonzalez Esmeralda, Katsanis Nicholas, Bustamante Carlos D, Wang Quanli, Petrovski Slavé, Vitsios Dimitrios, Dhindsa Ryan S
Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA.
Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, Houston, TX, USA.
Nat Commun. 2025 Mar 18;16(1):2648. doi: 10.1038/s41467-025-57885-5.
The unprecedented scale of genomic databases has revolutionized our ability to identify regions in the human genome intolerant to variation-regions often implicated in disease. However, these datasets remain constrained by limited ancestral diversity. Here, we analyze whole-exome sequencing data from 460,551 UK Biobank and 125,748 Genome Aggregation Database (gnomAD) participants across multiple ancestries to test several key intolerance metrics, including the Residual Variance Intolerance Score (RVIS), Missense Tolerance Ratio (MTR), and Loss-of-Function Observed/Expected ratio (LOF O/E). We demonstrate that increasing ancestral representation, rather than sample size alone, critically drives their performance. Scores trained on variation observed in African and Admixed American ancestral groups show higher resolution in detecting haploinsufficient and neurodevelopmental disease risk genes compared to scores trained on European ancestry groups. Most strikingly, MTR trained on 43,000 multi-ancestry exomes demonstrates greater predictive power than when trained on a nearly 10-fold larger dataset of 440,000 non-Finnish European exomes. We further find that European ancestry group-based scores are likely approaching saturation. These findings highlight the need for enhanced population representation in genomic resources to fully realize the potential of precision medicine and drug discovery. Ancestry group-specific scores are publicly available through an interactive portal: http://intolerance.public.cgr.astrazeneca.com/ .
基因组数据库前所未有的规模彻底改变了我们识别人类基因组中不耐受变异区域的能力,这些区域常常与疾病相关。然而,这些数据集仍然受到祖先多样性有限的限制。在这里,我们分析了来自英国生物银行的460,551名参与者以及基因组聚合数据库(gnomAD)的125,748名参与者的全外显子组测序数据,这些参与者来自多个祖先群体,以测试几个关键的不耐受指标,包括残余方差不耐受分数(RVIS)、错义耐受率(MTR)和功能丧失观察/预期比率(LOF O/E)。我们证明,增加祖先代表性,而不仅仅是样本量,对这些指标的性能起着关键作用。与基于欧洲祖先群体训练的分数相比,基于非洲和混血美国祖先群体中观察到的变异训练的分数在检测单倍剂量不足和神经发育疾病风险基因方面具有更高的分辨率。最引人注目的是,基于43,000个多祖先外显子组训练的MTR比基于近10倍大的440,000个非芬兰欧洲外显子组数据集训练时具有更大的预测能力。我们进一步发现,基于欧洲祖先群体的分数可能已接近饱和。这些发现凸显了在基因组资源中增加人群代表性的必要性,以便充分实现精准医学和药物发现的潜力。特定祖先群体的分数可通过一个交互式门户公开获取:http://intolerance.public.cgr.astrazeneca.com/ 。