Varma Maya, Paskov Kelley Marie, Jung Jae-Yoon, Sierra Chrisman Brianna, Stockham Nate Tyler, Washington Peter Yigitcan, Wall Dennis Paul
Departments of Computer Science, Stanford University, Stanford, CA 94305, USA.
Pac Symp Biocomput. 2019;24:260-271.
Autism spectrum disorder (ASD) is a heritable neurodevelopmental disorder affecting 1 in 59 children. While noncoding genetic variation has been shown to play a major role in many complex disorders, the contribution of these regions to ASD susceptibility remains unclear. Genetic analyses of ASD typically use unaffected family members as controls; however, we hypothesize that this method does not effectively elevate variant signal in the noncoding region due to family members having subclinical phenotypes arising from common genetic mechanisms. In this study, we use a separate, unrelated outgroup of individuals with progressive supranuclear palsy (PSP), a neurodegenerative condition with no known etiological overlap with ASD, as a control population. We use whole genome sequencing data from a large cohort of 2182 children with ASD and 379 controls with PSP, sequenced at the same facility with the same machines and variant calling pipeline, in order to investigate the role of noncoding variation in the ASD phenotype. We analyze seven major types of noncoding variants: microRNAs, human accelerated regions, hypersensitive sites, transcription factor binding sites, DNA repeat sequences, simple repeat sequences, and CpG islands. After identifying and removing batch effects between the two groups, we trained an ℓ1-regularized logistic regression classifier to predict ASD status from each set of variants. The classifier trained on simple repeat sequences performed well on a held-out test set (AUC-ROC = 0.960); this classifier was also able to differentiate ASD cases from controls when applied to a completely independent dataset (AUC-ROC = 0.960). This suggests that variation in simple repeat regions is predictive of the ASD phenotype and may contribute to ASD risk. Our results show the importance of the noncoding region and the utility of independent control groups in effectively linking genetic variation to disease phenotype for complex disorders.
自闭症谱系障碍(ASD)是一种遗传性神经发育障碍,每59名儿童中就有1人受其影响。虽然非编码基因变异已被证明在许多复杂疾病中起主要作用,但这些区域对ASD易感性的贡献仍不清楚。ASD的基因分析通常使用未受影响的家庭成员作为对照;然而,我们推测这种方法不能有效地提高非编码区域的变异信号,因为家庭成员具有由共同遗传机制引起的亚临床表型。在本研究中,我们使用一组单独的、无亲缘关系的进行性核上性麻痹(PSP)患者作为对照人群,PSP是一种神经退行性疾病,与ASD没有已知的病因重叠。我们使用来自2182名患有ASD的儿童和379名患有PSP的对照的大样本队列的全基因组测序数据,这些数据在同一机构使用相同的机器和变异检测流程进行测序,以研究非编码变异在ASD表型中的作用。我们分析了七种主要类型的非编码变异:微小RNA、人类加速区域、超敏位点、转录因子结合位点、DNA重复序列、简单重复序列和CpG岛。在识别并消除两组之间的批次效应后,我们训练了一个ℓ1正则化逻辑回归分类器,以根据每组变异预测ASD状态。在简单重复序列上训练的分类器在一个留出的测试集上表现良好(AUC-ROC = 0.960);当应用于一个完全独立的数据集时,该分类器也能够区分ASD病例和对照(AUC-ROC = 0.960)。这表明简单重复区域的变异可预测ASD表型,并可能导致ASD风险。我们的结果显示了非编码区域的重要性以及独立对照组在有效将基因变异与复杂疾病的疾病表型联系起来方面的效用。