Department of of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium.
Genome Biol. 2023 Oct 5;24(1):224. doi: 10.1186/s13059-023-03064-y.
Despite clear evidence of nonlinear interactions in the molecular architecture of polygenic diseases, linear models have so far appeared optimal in genotype-to-phenotype modeling. A key bottleneck for such modeling is that genetic data intrinsically suffers from underdetermination ([Formula: see text]). Millions of variants are present in each individual while the collection of large, homogeneous cohorts is hindered by phenotype incidence, sequencing cost, and batch effects.
We demonstrate that when we provide enough training data and control the complexity of nonlinear models, a neural network outperforms additive approaches in whole exome sequencing-based inflammatory bowel disease case-control prediction. To do so, we propose a biologically meaningful sparsified neural network architecture, providing empirical evidence for positive and negative epistatic effects present in the inflammatory bowel disease pathogenesis.
In this paper, we show that underdetermination is likely a major driver for the apparent optimality of additive modeling in clinical genetics today.
尽管多基因疾病的分子结构中存在明显的非线性相互作用,但线性模型在基因型到表型建模中似乎迄今为止一直是最优的。这种建模的一个关键瓶颈是遗传数据本质上存在欠定问题([公式:见正文])。每个人的个体中都存在数百万个变体,而大型、同质队列的收集受到表型发生率、测序成本和批次效应的阻碍。
我们证明,当我们提供足够的训练数据并控制非线性模型的复杂性时,神经网络在基于全外显子组测序的炎症性肠病病例对照预测中的表现优于加性方法。为此,我们提出了一种具有生物学意义的稀疏神经网络架构,为炎症性肠病发病机制中存在的正和负上位效应提供了经验证据。
在本文中,我们表明,欠定问题很可能是当今临床遗传学中加性建模表现出明显最优性的主要驱动因素。