Suppr超能文献

随机森林在全基因组关联数据集上的应用:方法学考虑与新发现。

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

机构信息

Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA.

出版信息

BMC Genet. 2010 Jun 14;11:49. doi: 10.1186/1471-2156-11-49.

Abstract

BACKGROUND

As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.

RESULTS

Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies.

CONCLUSIONS

This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.

摘要

背景

随着计算能力的提高,将更先进的机器学习技术应用于大型全基因组关联(GWA)数据集的分析成为可能。虽然大多数传统统计方法只能阐明遗传变异对疾病风险的主要影响,但某些机器学习方法特别适合发现更高阶和非线性效应。其中一种方法是随机森林(RF)算法。近年来,RF 算法在与人类疾病相关的 SNP 发现中的应用有所增加;然而,大多数工作都集中在小数据集或模拟研究上,这些研究受到限制。

结果

使用包含 300 K SNP 基因型的多发性硬化症(MS)病例对照数据集,我们概述了一种方法,并考虑了一些最佳调整 RF 算法的因素,这是基于经验数据集的。重要的是,结果表明,典型的默认参数值不适用于大型 GWA 数据集。此外,通过对数据进行抽样、基于连锁不平衡(LD)修剪以及从 RF 分析中去除强效应,可以获得收益。新的 RF 结果与原始 MS GWA 研究的结果进行了比较,显示出重叠。此外,通过 RF 分析鉴定了四个新的有趣的 MS 候选基因,即 MPHOSPH9、CTNNA3、PHACTR2 和 IL7,值得在独立研究中进一步跟进。

结论

本研究首次成功地展示了使用机器学习算法分析 GWA 数据的实例之一。结果表明,RF 对于 GWA 数据是可行的,并且基于先前的研究,所获得的结果具有生物学意义。更重要的是,鉴定出了一些新的与 MS 相关的潜在基因,为这种复杂疾病提供了新的研究途径。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验