Department of Clinical Epidemiology, Biostatistics and Bioinformatics.
Department of Medical Informatics, Academic Medical Center, Amsterdam 1105 AZ, The Netherlands.
Bioinformatics. 2017 Oct 15;33(20):3228-3234. doi: 10.1093/bioinformatics/btx374.
Recent technological developments have enabled the possibility of genetic and genomic integrated data analysis approaches, where multiple omics datasets from various biological levels are combined and used to describe (disease) phenotypic variations. The main goal is to explain and ultimately predict phenotypic variations by understanding their genetic basis and the interaction of the associated genetic factors. Therefore, understanding the underlying genetic mechanisms of phenotypic variations is an ever increasing research interest in biomedical sciences. In many situations, we have a set of variables that can be considered to be the outcome variables and a set that can be considered to be explanatory variables. Redundancy analysis (RDA) is an analytic method to deal with this type of directionality. Unfortunately, current implementations of RDA cannot deal optimally with the high dimensionality of omics data (p≫n). The existing theoretical framework, based on Ridge penalization, is suboptimal, since it includes all variables in the analysis. As a solution, we propose to use Elastic Net penalization in an iterative RDA framework to obtain a sparse solution.
We proposed sparse redundancy analysis (sRDA) for high dimensional omics data analysis. We conducted simulation studies with our software implementation of sRDA to assess the reliability of sRDA. Both the analysis of simulated data, and the analysis of 485 512 methylation markers and 18,424 gene-expression values measured in a set of 55 patients with Marfan syndrome show that sRDA is able to deal with the usual high dimensionality of omics data.
Supplementary data are available at Bioinformatics online.
最近的技术发展使得进行遗传和基因组综合数据分析成为可能,其中来自不同生物学水平的多个组学数据集被组合并用于描述(疾病)表型变异。主要目标是通过了解遗传基础和相关遗传因素的相互作用来解释和最终预测表型变异。因此,了解表型变异的潜在遗传机制是生物医学科学中日益增长的研究兴趣。在许多情况下,我们有一组可以被认为是结果变量的变量,以及一组可以被认为是解释变量的变量。冗余分析(RDA)是一种用于处理这种方向性的分析方法。不幸的是,当前的 RDA 实现不能最优地处理组学数据的高维性(p≫n)。基于岭惩罚的现有理论框架是次优的,因为它包括了分析中的所有变量。作为一种解决方案,我们建议在迭代 RDA 框架中使用弹性网络惩罚来获得稀疏解。
我们提出了用于高维组学数据分析的稀疏冗余分析(sRDA)。我们使用我们的 sRDA 软件实现进行了模拟研究,以评估 sRDA 的可靠性。模拟数据的分析以及对 55 例马凡综合征患者的一组 485512 个甲基化标记和 18424 个基因表达值的分析表明,sRDA 能够处理通常的组学数据的高维性。
补充数据可在 Bioinformatics 在线获取。