Maj Carlo, Azevedo Tiago, Giansanti Valentina, Borisov Oleg, Dimitri Giovanna Maria, Spasov Simeon, Lió Pietro, Merelli Ivan
Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Bonn, Germany.
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom.
Front Genet. 2019 Sep 3;10:726. doi: 10.3389/fgene.2019.00726. eCollection 2019.
The genetic component of many common traits is associated with the gene expression and several variants act as expression quantitative loci, regulating the gene expression in a tissue specific manner. In this work, we applied tissue-specific cis-eQTL gene expression prediction models on the genotype of 808 samples including controls, subjects with mild cognitive impairment, and patients with Alzheimer's Disease. We then dissected the imputed transcriptomic profiles by means of different unsupervised and supervised machine learning approaches to identify potential biological associations. Our analysis suggests that unsupervised and supervised methods can provide complementary information, which can be integrated for a better characterization of the underlying biological system. In particular, a variational autoencoder representation of the transcriptomic profiles, followed by a support vector machine classification, has been used for tissue-specific gene prioritizations. Interestingly, the achieved gene prioritizations can be efficiently integrated as a feature selection step for improving the accuracy of deep learning classifier networks. The identified gene-tissue information suggests a potential role for inflammatory and regulatory processes in gut-brain axis related tissues. In line with the expected low heritability that can be apportioned to eQTL variants, we were able to achieve only relatively low prediction capability with deep learning classification models. However, our analysis revealed that the classification power strongly depends on the network structure, with recurrent neural networks being the best performing network class. Interestingly, cross-tissue analysis suggests a potentially greater role of models trained in brain tissues also by considering dementia-related endophenotypes. Overall, the present analysis suggests that the combination of supervised and unsupervised machine learning techniques can be used for the evaluation of high dimensional omics data.
许多常见性状的遗传成分与基因表达相关,一些变异体作为表达数量性状位点,以组织特异性方式调节基因表达。在这项研究中,我们将组织特异性顺式表达数量性状基因座(cis-eQTL)基因表达预测模型应用于808个样本的基因型,这些样本包括对照组、轻度认知障碍受试者和阿尔茨海默病患者。然后,我们通过不同的无监督和有监督机器学习方法剖析估算的转录组图谱,以识别潜在的生物学关联。我们的分析表明,无监督和有监督方法可以提供互补信息,可将这些信息整合起来以更好地表征潜在的生物系统。特别是,转录组图谱的变分自编码器表示,随后进行支持向量机分类,已用于组织特异性基因优先级排序。有趣的是,所实现的基因优先级排序可以有效地作为特征选择步骤进行整合,以提高深度学习分类器网络的准确性。所识别的基因-组织信息表明炎症和调节过程在肠-脑轴相关组织中具有潜在作用。与可归因于表达数量性状位点变异体的预期低遗传力一致,我们使用深度学习分类模型仅实现了相对较低的预测能力。然而,我们的分析表明,分类能力很大程度上取决于网络结构,循环神经网络是表现最佳的网络类别。有趣的是,跨组织分析表明,通过考虑与痴呆相关的内表型,在脑组织中训练的模型可能也具有更大的作用。总体而言,本分析表明,有监督和无监督机器学习技术的结合可用于评估高维组学数据。