Department of Computational & Systems Biology and Center for Evolutionary Biology and Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA; Department of Human Genetics, University of Pittsburgh School of Public Health, Pittsburgh, PA, USA.
Children's Hospital of Philadelphia, Philadelphia, PA, USA.
HGG Adv. 2024 Jul 18;5(3):100310. doi: 10.1016/j.xhgg.2024.100310. Epub 2024 May 21.
Non-protein-coding genetic variants are a major driver of the genetic risk for human disease; however, identifying which non-coding variants contribute to diseases and their mechanisms remains challenging. In silico variant prioritization methods quantify a variant's severity, but for most methods, the specific phenotype and disease context of the prediction remain poorly defined. For example, many commonly used methods provide a single, organism-wide score for each variant, while other methods summarize a variant's impact in certain tissues and/or cell types. Here, we propose a complementary disease-specific variant prioritization scheme, which is motivated by the observation that variants contributing to disease often operate through specific biological mechanisms. We combine tissue/cell-type-specific variant scores (e.g., GenoSkyline, FitCons2, DNA accessibility) into disease-specific scores with a logistic regression approach and apply it to ∼25,000 non-coding variants spanning 111 diseases. We show that this disease-specific aggregation significantly improves the association of common non-coding genetic variants with disease (average precision: 0.151, baseline = 0.09), compared with organism-wide scores (GenoCanyon, LINSIGHT, GWAVA, Eigen, CADD; average precision: 0.129, baseline = 0.09). Further on, disease similarities based on data-driven aggregation weights highlight meaningful disease groups, and it provides information about tissues and cell types that drive these similarities. We also show that so-learned similarities are complementary to genetic similarities as quantified by genetic correlation. Overall, our approach demonstrates the strengths of disease-specific variant prioritization, leads to improvement in non-coding variant prioritization, and enables interpretable models that link variants to disease via specific tissues and/or cell types.
非蛋白编码遗传变异是人类疾病遗传风险的主要驱动因素;然而,确定哪些非编码变异导致疾病及其机制仍然具有挑战性。 基于计算机的变异优先级方法量化了变异的严重程度,但对于大多数方法而言,预测的具体表型和疾病背景仍然定义不明确。 例如,许多常用的方法为每个变异提供一个单一的、全器官的评分,而其他方法则在某些组织和/或细胞类型中总结变异的影响。 在这里,我们提出了一种互补的疾病特异性变异优先级方案,这是受到以下观察结果的启发:导致疾病的变异通常通过特定的生物学机制起作用。 我们使用逻辑回归方法将组织/细胞类型特异性变异评分(例如,GenoSkyline、FitCons2、DNA 可及性)组合成疾病特异性评分,并将其应用于跨越 111 种疾病的约 25000 个非编码变体。 我们表明,与全器官评分(GenoCanyon、LINSIGHT、GWAVA、Eigen、CADD;平均精度:0.129,基线= 0.09)相比,这种疾病特异性聚集显著提高了常见非编码遗传变异与疾病的关联(平均精度:0.151,基线= 0.09)。 此外,基于数据驱动的聚集权重的疾病相似性突出了有意义的疾病组,并提供了有关驱动这些相似性的组织和细胞类型的信息。 我们还表明,如此学习到的相似性与遗传相关性量化的遗传相似性是互补的。 总体而言,我们的方法展示了疾病特异性变异优先级的优势,导致非编码变异优先级的改进,并提供了可解释的模型,通过特定的组织和/或细胞类型将变体与疾病联系起来。