随机森林模型对复杂疾病的多基因建模。

Multigenic modeling of complex disease by random forests.

机构信息

Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, Michigan, USA.

出版信息

Adv Genet. 2010;72:73-99. doi: 10.1016/B978-0-12-380862-2.00004-7.

Abstract

The genetics and heredity of complex human traits have been studied for over a century. Many genes have been implicated in these complex traits. Genome-wide association studies (GWAS) were designed to investigate the association between common genetic variation and complex human traits using high-throughput platforms that measured hundreds of thousands of common single-nucleotide polymorphisms (SNPs). GWAS have successfully identified many novel genetic loci associated with complex traits using a univariate regression-based approach. Even for traits with a large number of identified variants, only a small fraction of the interindividual variation in risk phenotypes has been explained. In biological systems, protein, DNA, RNA, and metabolites frequently interact to each other to perform their biological functions, and to respond to environmental factors. The complex interactions among genes and between the genes and environment may partially explain the "missing heritability." The traditional regression-based methods are limited to address the complex interactions among the hundreds of thousands of SNPs and their environmental context by both the modeling and computational challenge. Random Forests (RF), one of the powerful machine learning methods, is regarded as a useful alternative to capture the complex interaction effects among the GWAS data, and potentially address the genetic heterogeneity underlying these complex traits using a computationally efficient framework. In this chapter, the features of prediction and variable selection, and their applications in genetic association studies are reviewed and discussed. Additional improvements of the original RF method are warranted to make the applications in GWAS to be more successful.

摘要

一个多世纪以来，人们一直在研究复杂人类特征的遗传学和遗传。许多基因都与这些复杂特征有关。全基因组关联研究（GWAS）旨在使用高通量平台研究常见遗传变异与复杂人类特征之间的关联，这些平台可以测量数十万种常见的单核苷酸多态性（SNP）。GWAS 已经成功地使用基于单变量回归的方法识别了许多与复杂特征相关的新遗传位点。即使对于具有大量已识别变体的特征，风险表型的个体间变异也只有一小部分得到了解释。在生物系统中，蛋白质、DNA、RNA 和代谢物经常相互作用以执行其生物学功能，并对环境因素做出反应。基因之间以及基因与环境之间的复杂相互作用可能部分解释了“缺失的遗传力”。传统的基于回归的方法受到建模和计算挑战的限制，无法解决数万个 SNP 及其环境背景之间的复杂相互作用。随机森林（RF）是一种强大的机器学习方法，被认为是一种有用的替代方法，可以捕捉 GWAS 数据之间的复杂相互作用效应，并使用计算效率高的框架潜在地解决这些复杂特征下的遗传异质性。在本章中，对预测和变量选择的特征及其在遗传关联研究中的应用进行了回顾和讨论。需要对原始 RF 方法进行额外的改进，以使 GWAS 的应用更加成功。