Dhindsa Ryan S, Weido Blake A, Dhindsa Justin S, Shetty Arya J, Sands Chloe F, Petrovski Slavé, Vitsios Dimitrios, Zoghbi Anthony W
Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA; Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, Houston, TX, USA.
Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA.
Am J Hum Genet. 2025 Mar 6;112(3):693-708. doi: 10.1016/j.ajhg.2025.02.001. Epub 2025 Feb 26.
Despite great progress, thousands of neurodevelopmental disorder (NDD) risk genes remain to be discovered. We present a computational approach that accelerates NDD risk gene identification using machine learning. First, we demonstrate that models trained solely on single-cell RNA sequencing data can robustly predict genes implicated in autism spectrum disorder (ASD), developmental and epileptic encephalopathy (DEE), and developmental delay (DD). Notably, we find differences in gene expression patterns of genes with monoallelic and bi-allelic inheritance patterns in the developing human cortex. We then integrate expression data with 300 orthogonal features, including intolerance metrics, protein-protein interaction data, and others, in a semi-supervised machine learning framework (mantis-ml) to train inheritance-specific models for these disorders. The models have high predictive power (area under the receiver operator curves [AUCs]: 0.84-0.95), and the top-ranked genes were up to 2-fold (monoallelic models) and 6-fold (bi-allelic models) more enriched for high-confidence NDD risk genes compared to genic intolerance metrics alone. Additionally, genes ranking in the top decile were 45 to 180 times more likely to have literature support than those in the bottom decile. Collectively, this work provides robust NDD risk gene predictions that can complement large-scale gene discovery efforts and underscores the importance of considering inheritance in gene risk prediction.
尽管取得了巨大进展,但仍有数千个神经发育障碍(NDD)风险基因有待发现。我们提出了一种计算方法,利用机器学习加速NDD风险基因的识别。首先,我们证明仅在单细胞RNA测序数据上训练的模型能够可靠地预测与自闭症谱系障碍(ASD)、发育性和癫痫性脑病(DEE)以及发育迟缓(DD)相关的基因。值得注意的是,我们发现在发育中的人类皮质中,具有单等位基因和双等位基因遗传模式的基因的表达模式存在差异。然后,我们在一个半监督机器学习框架(mantis-ml)中将表达数据与300个正交特征(包括不耐受指标、蛋白质-蛋白质相互作用数据等)整合起来,为这些疾病训练特定遗传模式的模型。这些模型具有很高的预测能力(受试者工作特征曲线下面积[AUC]:0.84 - 0.95),与仅使用基因不耐受指标相比,排名靠前的基因在高置信度NDD风险基因中的富集程度高达2倍(单等位基因模型)和6倍(双等位基因模型)。此外,排名在前十分位的基因获得文献支持的可能性是排名在后十分位的基因的45至180倍。总体而言,这项工作提供了可靠的NDD风险基因预测,可补充大规模基因发现工作,并强调了在基因风险预测中考虑遗传因素的重要性。