Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, 55131 Mainz, Germany.
Department of Internal Medicine III, University Hospital of Ulm, 89081 Ulm, Germany.
Bioinformatics. 2017 Oct 15;33(20):3173-3180. doi: 10.1093/bioinformatics/btx408.
Learning the joint distributions of measurements, and in particular identification of an appropriate low-dimensional manifold, has been found to be a powerful ingredient of deep leaning approaches. Yet, such approaches have hardly been applied to single nucleotide polymorphism (SNP) data, probably due to the high number of features typically exceeding the number of studied individuals.
After a brief overview of how deep Boltzmann machines (DBMs), a deep learning approach, can be adapted to SNP data in principle, we specifically present a way to alleviate the dimensionality problem by partitioned learning. We propose a sparse regression approach to coarsely screen the joint distribution of SNPs, followed by training several DBMs on SNP partitions that were identified by the screening. Aggregate features representing SNP patterns and the corresponding SNPs are extracted from the DBMs by a combination of statistical tests and sparse regression. In simulated case-control data, we show how this can uncover complex SNP patterns and augment results from univariate approaches, while maintaining type 1 error control. Time-to-event endpoints are considered in an application with acute myeloid leukemia patients, where SNP patterns are modeled after a pre-screening based on gene expression data. The proposed approach identified three SNPs that seem to jointly influence survival in a validation dataset. This indicates the added value of jointly investigating SNPs compared to standard univariate analyses and makes partitioned learning of DBMs an interesting complementary approach when analyzing SNP data.
A Julia package is provided at 'http://github.com/binderh/BoltzmannMachines.jl'.
Supplementary data are available at Bioinformatics online.
学习测量的联合分布,特别是识别适当的低维流形,已被发现是深度学习方法的有力组成部分。然而,由于特征数量通常超过研究个体的数量,这种方法几乎没有应用于单核苷酸多态性(SNP)数据。
简要概述了深度玻尔兹曼机(DBM)作为一种深度学习方法,如何能够从原则上适用于 SNP 数据之后,我们特别提出了一种通过分区学习来缓解维度问题的方法。我们提出了一种稀疏回归方法来粗略筛选 SNP 的联合分布,然后在由筛选确定的 SNP 分区上训练几个 DBM。通过统计检验和稀疏回归相结合,从 DBM 中提取代表 SNP 模式和相应 SNP 的聚合特征。在模拟病例对照数据中,我们展示了这如何揭示复杂的 SNP 模式,并增强了单变量方法的结果,同时保持了 1 型错误控制。在急性髓系白血病患者的时间事件终点应用中,基于基因表达数据的预筛选,对 SNP 模式进行建模。所提出的方法在验证数据集中确定了三个似乎共同影响生存的 SNP。这表明与标准单变量分析相比,联合研究 SNP 的附加价值,并使 DBM 的分区学习成为分析 SNP 数据时一种有趣的补充方法。
提供了一个 Julia 包,网址为“http://github.com/binderh/BoltzmannMachines.jl”。
补充数据可在生物信息学在线获取。