Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany.
Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, United States of America.
PLoS One. 2024 Apr 16;19(4):e0298906. doi: 10.1371/journal.pone.0298906. eCollection 2024.
Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.
检测人类表型的上位驱动因素是一项艰巨的挑战。传统方法使用回归来依次测试涉及遗传变异对的乘法交互项。对于更高阶的交互作用和全基因组的大规模数据,这种策略在计算上是不可行的。此外,回归模型中使用的乘法项可能无法捕捉生物交互作用的形式。基于可预测性、可计算性、稳定性 (PCS) 框架,我们引入了 epiTree 管道,使用基于树的模型从基因组数据中提取高阶交互作用。epiTree 管道首先从组织特异性基因表达估计中选择一组变体。接下来,它使用迭代随机森林 (iRF) 在训练数据中搜索候选布尔交互作用(成对和更高阶)。我们基于稳定似然比检验为交互作用推导显著性检验,通过在保留测试数据上模拟布尔树结构的零假设(无上位效应)和替代假设(上位效应)分布来实现。最后,我们的管道通过在测试集上进行引导抽样计算 PCS 上位效应 p 值,概率地量化通过预测准确性的提高。我们在两个使用英国生物库 (UK Biobank) 数据的案例研究中验证了 epiTree 管道:预测红头发和多发性硬化症 (MS)。在预测红头发的情况下,epiTree 恢复了围绕 MC1R 的已知上位相互作用和新的相互作用,这些相互作用代表了逻辑回归模型无法捕捉的非线性。在预测 MS 的情况下,MS 是比红头发更复杂的表型,epiTree 排名优先考虑 HLA-DRB1 周围的新相互作用,该变体在多个人群中与 MS 相关。总之,这些结果强调了 epiTree 排名有可能帮助缩小后续实验的设计空间。