一种基于U统计量的随机森林方法用于基因关联研究。

A U-Statistic-based random Forest approach for genetic association study.

作者信息

Li Ming, Peng Ruo-Sin, Wei Changshuai, Lu Qing

机构信息

Department of Epidemiology, Michigan State University, East Lansing, MI 48824, USA.

出版信息

Front Biosci (Elite Ed). 2012 Jun 1;4(7):2607-2617. doi: 10.2741/e576.

DOI:10.2741/e576

PMID:22652671

Abstract

Variations in complex traits are influenced by multiple genetic variants, environmental risk factors, and their interactions. Though substantial progress has been made in identifying single genetic variants associated with complex traits, detecting the gene-gene and gene-environment interactions remains a great challenge. When a large number of genetic variants and environmental risk factors are involved, searching for interactions is limited to pair-wise interactions due to the exponentially increased feature space and computational intensity. Alternatively, recursive partitioning approaches, such as random forests, have gained popularity in high-dimensional genetic association studies. In this article, we propose a U-Statistic-based random forest approach, referred to as Forest U-Test, for genetic association studies with quantitative traits. Through simulation studies, we showed that the Forest U-Test outperformed exiting methods. The proposed method was also applied to study Cannabis Dependence (CD), using three independent datasets from the Study of Addiction: Genetics and Environment. A significant joint association was detected with an empirical p-value less than 0.001. The finding was also replicated in two independent datasets with p-values of 5.93e-19 and 4.70e-17, respectively.

摘要

复杂性状的变异受到多种遗传变异、环境风险因素及其相互作用的影响。尽管在识别与复杂性状相关的单个遗传变异方面已取得重大进展，但检测基因-基因和基因-环境相互作用仍然是一项巨大挑战。当涉及大量遗传变异和环境风险因素时，由于特征空间和计算强度呈指数级增加，搜索相互作用仅限于成对相互作用。另外，递归划分方法，如随机森林，在高维遗传关联研究中越来越受欢迎。在本文中，我们提出了一种基于U统计量的随机森林方法，称为森林U检验，用于定量性状的遗传关联研究。通过模拟研究，我们表明森林U检验优于现有方法。所提出的方法还应用于研究大麻依赖（CD），使用了来自成瘾：遗传学与环境研究的三个独立数据集。检测到显著的联合关联，经验p值小于0.001。这一发现也在另外两个独立数据集中得到重复，p值分别为5.93e-19和4.70e-17。