Raimondi Daniele, Simm Jaak, Arany Adam, Fariselli Piero, Cleynen Isabelle, Moreau Yves
ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium.
Department of Medical Sciences, University of Torino, Torino, 10123 Italy.
NAR Genom Bioinform. 2020 Feb 21;2(1):lqaa011. doi: 10.1093/nargab/lqaa011. eCollection 2020 Mar.
Whole exome sequencing (WES) data are allowing researchers to pinpoint the causes of many Mendelian disorders. In time, sequencing data will be crucial to solve the puzzle, which aims at uncovering the genotype-to-phenotype relationship, but for the moment many conceptual and technical problems need to be addressed. In particular, very few attempts at the in-silico diagnosis of oligo-to-polygenic disorders have been made so far, due to the complexity of the challenge, the relative scarcity of the data and issues such as and data heterogeneity, which are confounder factors for machine learning (ML) methods. Here, we propose a method for the exome-based diagnosis of Crohn's disease (CD) patients which addresses many of the current methodological issues. First, we devise a rational ML-friendly feature representation for WES data based on the concept, which is suitable for small sample sizes datasets. Second, we propose a Neural Network (NN) with and heavy regularization, in order to limit its complexity and thus the risk of over-fitting. We trained and tested our NN on 3 CD case-controls datasets, comparing the performance with the participants of previous CAGI challenges. We show that, notwithstanding the limited NN complexity, it outperforms the previous approaches. Moreover, we interpret the NN predictions by analyzing the learned patterns at the variant and gene level and investigating the decision process leading to each prediction.
全外显子组测序(WES)数据使研究人员能够查明许多孟德尔疾病的病因。随着时间的推移,测序数据对于解开旨在揭示基因型与表型关系的谜题至关重要,但目前许多概念和技术问题仍需解决。特别是,由于挑战的复杂性、数据相对稀缺以及诸如数据异质性等问题(这些都是机器学习(ML)方法的混杂因素),到目前为止,针对寡基因到多基因疾病的计算机辅助诊断的尝试非常少。在此,我们提出了一种基于外显子组的克罗恩病(CD)患者诊断方法,该方法解决了许多当前的方法学问题。首先,我们基于概念为WES数据设计了一种合理的、对ML友好的特征表示,适用于小样本量数据集。其次,我们提出了一种具有和强正则化的神经网络(NN),以限制其复杂性,从而降低过拟合风险。我们在3个CD病例对照数据集上对我们的NN进行了训练和测试,并将性能与之前CAGI挑战的参与者进行了比较。我们表明,尽管NN的复杂性有限,但它优于之前的方法。此外,我们通过分析变异和基因水平上学习到的模式并研究导致每个预测的决策过程来解释NN的预测。