一种灵活的计算框架，用于在人类疾病易感性的遗传研究中检测、表征和解释上位性的统计模式。

A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility.

作者信息

Moore Jason H, Gilbert Joshua C, Tsai Chia-Ti, Chiang Fu-Tien, Holden Todd, Barney Nate, White Bill C

机构信息

Computational Genetics Laboratory, Department of Genetics, Dartmouth-Hitchcock Medical Center, One Medical Center Dr., 706 Rubin Bldg, HB7937, Lebanon, NH 03756, USA.

出版信息

J Theor Biol. 2006 Jul 21;241(2):252-61. doi: 10.1016/j.jtbi.2005.11.036. Epub 2006 Feb 2.

DOI:10.1016/j.jtbi.2005.11.036

PMID:16457852

Abstract

Detecting, characterizing, and interpreting gene-gene interactions or epistasis in studies of human disease susceptibility is both a mathematical and a computational challenge. To address this problem, we have previously developed a multifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension (i.e. constructive induction) thus permitting interactions to be detected in relatively small sample sizes. In this paper, we describe a comprehensive and flexible framework for detecting and interpreting gene-gene interactions that utilizes advances in information theory for selecting interesting single-nucleotide polymorphisms (SNPs), MDR for constructive induction, machine learning methods for classification, and finally graphical models for interpretation. We illustrate the usefulness of this strategy using artificial datasets simulated from several different two-locus and three-locus epistasis models. We show that the accuracy, sensitivity, specificity, and precision of a naïve Bayes classifier are significantly improved when SNPs are selected based on their information gain (i.e. class entropy removed) and reduced to a single attribute using MDR. We then apply this strategy to detecting, characterizing, and interpreting epistatic models in a genetic study (n = 500) of atrial fibrillation and show that both classification and model interpretation are significantly improved.

摘要

在人类疾病易感性研究中，检测、表征和解释基因-基因相互作用或上位性，既是一项数学挑战，也是一项计算挑战。为了解决这个问题，我们之前开发了一种多因素降维（MDR）方法，用于将高维遗传数据压缩到一个维度（即构造性归纳），从而能够在相对较小的样本量中检测到相互作用。在本文中，我们描述了一个全面且灵活的框架，用于检测和解释基因-基因相互作用，该框架利用信息论的进展来选择有趣的单核苷酸多态性（SNP），利用MDR进行构造性归纳，利用机器学习方法进行分类，最后利用图形模型进行解释。我们使用从几个不同的两位点和三位点上位性模型模拟的人工数据集来说明这种策略的有用性。我们表明，当基于信息增益（即去除类熵）选择SNP并使用MDR将其降为单个属性时，朴素贝叶斯分类器的准确性、敏感性、特异性和精确性会显著提高。然后，我们将这种策略应用于一项心房颤动遗传研究（n = 500）中检测、表征和解释上位性模型，并表明分类和模型解释都得到了显著改善。