Yang Haitao, Cao Hongyan, He Tao, Wang Tong, Cui Yuehua
Department of Epidemiology and Health Statistics, School of Public Health, and Hebei Province Key Laboratory of Environment and Human Health, Hebei Medical University, Shijiazhuang, PR China.
Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China.
Brief Bioinform. 2020 Jan 17;21(1):156-170. doi: 10.1093/bib/bby115.
High-throughput omics data are generated almost with no limit nowadays. It becomes increasingly important to integrate different omics data types to disentangle the molecular machinery of complex diseases with the hope for better disease prevention and treatment. Since the relationship among different omics data features are typically unknown, a supervised learning model assuming a particular distribution with a specific structure will not serve the purpose to capture the underlying complex relationship between multiple features and a disease phenotype. In this work, we briefly reviewed methods for kernel fusion (KF) based on support vector machine and kernel partial least squares (KPLS) algorithms. We then proposed a fused KPLS (fKPLS) model for disease classification and prediction with multilevel omics data. The fused kernel can deal with effect heterogeneity in which different omic data types may have different effect contribution to the trait of interest, with the purpose to improve the prediction performance. We proposed to optimize the kernel parameters and kernel weights with the genetic algorithm (GA). The proposed GA-fKPLS model can substantially improve disease classification performance by integrating multiple omics data types, demonstrated via extensive simulations and real data analysis. With properly defined fitness functions during GA optimization, the proposed KF method can be extended to other kernel-based analyses such as in kernel association analysis with common or rare variants.
如今,高通量组学数据几乎不受限制地产生。整合不同的组学数据类型以厘清复杂疾病的分子机制,从而有望实现更好的疾病预防和治疗,变得越来越重要。由于不同组学数据特征之间的关系通常是未知的,假设具有特定结构的特定分布的监督学习模型无法用于捕捉多个特征与疾病表型之间潜在的复杂关系。在这项工作中,我们简要回顾了基于支持向量机和核偏最小二乘法(KPLS)算法的核融合(KF)方法。然后,我们提出了一种用于多组学数据疾病分类和预测的融合KPLS(fKPLS)模型。融合核可以处理效应异质性,即不同的组学数据类型可能对感兴趣的性状有不同的效应贡献,目的是提高预测性能。我们建议用遗传算法(GA)优化核参数和核权重。通过广泛的模拟和实际数据分析表明,所提出的GA-fKPLS模型通过整合多种组学数据类型可以显著提高疾病分类性能。在GA优化过程中通过适当定义适应度函数,所提出的KF方法可以扩展到其他基于核的分析,如常见或罕见变异的核关联分析。