Chattopadhyay Amrita, Lu Tzu-Pin
Institute of Epidemiology and Preventive Medicine, Department of Public Health, National Taiwan University, Taipei.
Ann Transl Med. 2019 Dec;7(24):813. doi: 10.21037/atm.2019.12.87.
Identified genetic variants from genome wide association studies frequently show only modest effects on the disease risk, leading to the "missing heritability" problem. An avenue, to account for a part of this "missingness" is to evaluate gene-gene interactions (epistasis) thereby elucidating their effect on complex diseases. This can potentially help with identifying gene functions, pathways, and drug targets. However, the exhaustive evaluation of all possible genetic interactions among millions of single nucleotide polymorphisms (SNPs) raises several issues, otherwise known as the "curse of dimensionality". The dimensionality involved in the epistatic analysis of such exponentially growing SNPs diminishes the usefulness of traditional, parametric statistical methods. With the immense popularity of multifactor dimensionality reduction (MDR), a non-parametric method, proposed in 2001, that classifies multi-dimensional genotypes into one- dimensional binary approaches, led to the emergence of a fast-growing collection of methods that were based on the MDR approach. Moreover, machine-learning (ML) methods such as random forests and neural networks (NNs), deep-learning (DL) approaches, and hybrid approaches have also been applied profusely, in the recent years, to tackle this dimensionality issue associated with whole genome gene-gene interaction studies. However, exhaustive searching in MDR based approaches or variable selection in ML methods, still pose the risk of missing out on relevant SNPs. Furthermore, interpretability issues are a major hindrance for DL methods. To minimize this loss of information, Python based tools such as PySpark can potentially take advantage of distributed computing resources in the cloud, to bring back smaller subsets of data for further local analysis. Parallel computing can be a powerful resource that stands to fight this "curse". PySpark supports all standard Python libraries and C extensions thus making it convenient to write codes to deliver dramatic improvements in processing speed for extraordinarily large sets of data.
全基因组关联研究中识别出的基因变异通常对疾病风险的影响较小,从而导致了“遗传力缺失”问题。解释这种“缺失”的一个途径是评估基因-基因相互作用(上位性),从而阐明它们对复杂疾病的影响。这可能有助于识别基因功能、通路和药物靶点。然而,对数以百万计的单核苷酸多态性(SNP)之间所有可能的基因相互作用进行详尽评估会引发几个问题,也就是所谓的“维度灾难”。对如此呈指数增长的SNP进行上位性分析所涉及的维度降低了传统参数统计方法的实用性。随着2001年提出的非参数方法——多因素降维(MDR)的广泛应用,该方法将多维度基因型分类为一维二元方法,导致了一系列基于MDR方法的快速发展。此外,近年来,诸如随机森林和神经网络(NN)等机器学习(ML)方法、深度学习(DL)方法以及混合方法也被大量应用于解决与全基因组基因-基因相互作用研究相关的维度问题。然而,基于MDR的方法中的详尽搜索或ML方法中的变量选择仍然存在遗漏相关SNP的风险。此外,可解释性问题是DL方法的主要障碍。为了尽量减少这种信息损失,基于Python的工具(如PySpark)可能会利用云中的分布式计算资源,带回较小的数据子集进行进一步的本地分析。并行计算可以成为对抗这种“灾难”的强大资源。PySpark支持所有标准的Python库和C扩展,因此便于编写代码,为超大型数据集显著提高处理速度。