随机森林在全基因组关联数据集上的应用：方法学考虑与新发现。

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

机构信息

Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA.

出版信息

BMC Genet. 2010 Jun 14;11:49. doi: 10.1186/1471-2156-11-49.

DOI:10.1186/1471-2156-11-49

PMID:20546594

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2896336/

Abstract

BACKGROUND

As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.

RESULTS

Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies.

CONCLUSIONS

This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.

摘要

背景

随着计算能力的提高，将更先进的机器学习技术应用于大型全基因组关联（GWA）数据集的分析成为可能。虽然大多数传统统计方法只能阐明遗传变异对疾病风险的主要影响，但某些机器学习方法特别适合发现更高阶和非线性效应。其中一种方法是随机森林（RF）算法。近年来，RF 算法在与人类疾病相关的 SNP 发现中的应用有所增加；然而，大多数工作都集中在小数据集或模拟研究上，这些研究受到限制。

结果

使用包含 300 K SNP 基因型的多发性硬化症（MS）病例对照数据集，我们概述了一种方法，并考虑了一些最佳调整 RF 算法的因素，这是基于经验数据集的。重要的是，结果表明，典型的默认参数值不适用于大型 GWA 数据集。此外，通过对数据进行抽样、基于连锁不平衡（LD）修剪以及从 RF 分析中去除强效应，可以获得收益。新的 RF 结果与原始 MS GWA 研究的结果进行了比较，显示出重叠。此外，通过 RF 分析鉴定了四个新的有趣的 MS 候选基因，即 MPHOSPH9、CTNNA3、PHACTR2 和 IL7，值得在独立研究中进一步跟进。

结论

本研究首次成功地展示了使用机器学习算法分析 GWA 数据的实例之一。结果表明，RF 对于 GWA 数据是可行的，并且基于先前的研究，所获得的结果具有生物学意义。更重要的是，鉴定出了一些新的与 MS 相关的潜在基因，为这种复杂疾病提供了新的研究途径。

相似文献

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

BMC Genet. 2010 Jun 14;11:49. doi: 10.1186/1471-2156-11-49.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

SNP selection and classification of genome-wide SNP data using stratified sampling random forests.

IEEE Trans Nanobioscience. 2012 Sep;11(3):216-27. doi: 10.1109/TNB.2012.2214232.

Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle.

J Dairy Sci. 2013 Oct;96(10):6716-29. doi: 10.3168/jds.2012-6237. Epub 2013 Aug 9.

Random forests for genetic association studies.

Stat Appl Genet Mol Biol. 2011;10(1):32. doi: 10.2202/1544-6115.1691. Epub 2011 Jul 12.

Performance of random forest when SNPs are in linkage disequilibrium.

BMC Bioinformatics. 2009 Mar 5;10:78. doi: 10.1186/1471-2105-10-78.

Genome-wide association studies.

Methods Mol Biol. 2013;939:233-51. doi: 10.1007/978-1-62703-107-3_15.

An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

Bioinformatics. 2012 Oct 15;28(20):2615-23. doi: 10.1093/bioinformatics/bts483. Epub 2012 Jul 30.

Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.

BMC Bioinformatics. 2013;14 Suppl 16(Suppl 16):S6. doi: 10.1186/1471-2105-14-S16-S6. Epub 2013 Oct 22.

Were genome-wide linkage studies a waste of time? Exploiting candidate regions within genome-wide association studies.

Genet Epidemiol. 2010 Feb;34(2):107-18. doi: 10.1002/gepi.20438.

引用本文的文献

Leveraging Subjective Parameters and Biomarkers in Machine Learning Models: The Feasibility of for Managing Emphysema Progression.

Diagnostics (Basel). 2025 May 3;15(9):1165. doi: 10.3390/diagnostics15091165.

Comparison between logistic regression and machine learning algorithms on prediction of noise-induced hearing loss and investigation of SNP loci.

Sci Rep. 2025 May 2;15(1):15361. doi: 10.1038/s41598-025-00050-1.

Spatiotemporal trends and drivers of forest cover change in Metekel Zone forest areas, Northwest Ethiopia.

Environ Monit Assess. 2024 Nov 6;196(12):1170. doi: 10.1007/s10661-024-13294-7.

Predicting Survival Status in COVID-19 Patients: Machine Learning Models Development with Ventilator-Related and Biochemical Parameters from Early Stages: A Pilot Study.

J Clin Med. 2024 Oct 17;13(20):6190. doi: 10.3390/jcm13206190.

Proteomic Profiles of Maternal Plasma Extracellular Vesicles for Prediction of Preeclampsia.

Am J Reprod Immunol. 2024 Oct;92(4):e13928. doi: 10.1111/aji.13928.

Genome-wide association analysis of the resistance to infectious hematopoietic necrosis virus in two rainbow trout aquaculture lines confirms oligogenic architecture with several moderate effect quantitative trait loci.

Front Genet. 2024 May 24;15:1394656. doi: 10.3389/fgene.2024.1394656. eCollection 2024.

Source Attribution of Antibiotic Resistance Genes in Estuarine Aquaculture: A Machine Learning Approach.

Antibiotics (Basel). 2024 Jan 22;13(1):107. doi: 10.3390/antibiotics13010107.

Targeted Metabolomics Analysis Suggests That Tacrolimus Alters Protection against Oxidative Stress.

Antioxidants (Basel). 2023 Jul 12;12(7):1412. doi: 10.3390/antiox12071412.

Wide and deep learning based approaches for classification of Alzheimer's disease using genome-wide association studies.

PLoS One. 2023 May 1;18(5):e0283712. doi: 10.1371/journal.pone.0283712. eCollection 2023.

Proteomic profile of extracellular vesicles in maternal plasma of women with fetal death.

J Matern Fetal Neonatal Med. 2023 Dec;36(1):2177529. doi: 10.1080/14767058.2023.2177529.

本文引用的文献

Genome-wide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20.

Nat Genet. 2009 Jul;41(7):824-8. doi: 10.1038/ng.396. Epub 2009 Jun 14.

Meta-analysis of genome scans and replication identify CD6, IRF8 and TNFRSF1A as new multiple sclerosis susceptibility loci.

Nat Genet. 2009 Jul;41(7):776-82. doi: 10.1038/ng.401. Epub 2009 Jun 14.

Phactr2 and Parkinson's disease.

Neurosci Lett. 2009 Mar 27;453(1):9-11. doi: 10.1016/j.neulet.2009.02.009. Epub 2009 Feb 10.

Performance of random forest when SNPs are in linkage disequilibrium.

BMC Bioinformatics. 2009 Mar 5;10:78. doi: 10.1186/1471-2105-10-78.

Application of two machine learning algorithms to genetic association studies in the presence of covariates.

BMC Genet. 2008 Nov 14;9:71. doi: 10.1186/1471-2156-9-71.

Role of interleukin-7 in degenerative and inflammatory joint diseases.

Arthritis Res Ther. 2008;10(2):107. doi: 10.1186/ar2395. Epub 2008 Apr 18.

Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests.

BMC Proc. 2007;1 Suppl 1(Suppl 1):S62. doi: 10.1186/1753-6561-1-s1-s62. Epub 2007 Dec 18.

Analyses of single marker and pairwise effects of candidate loci for rheumatoid arthritis using logistic regression and random forests.

BMC Proc. 2007;1 Suppl 1(Suppl 1):S54. doi: 10.1186/1753-6561-1-s1-s54. Epub 2007 Dec 18.

How to interpret a genome-wide association study.

JAMA. 2008 Mar 19;299(11):1335-44. doi: 10.1001/jama.299.11.1335.

Association analysis of 528 intra-genic SNPs in a region of chromosome 10 linked to late onset Alzheimer's disease.

Am J Med Genet B Neuropsychiatr Genet. 2008 Sep 5;147B(6):727-31. doi: 10.1002/ajmg.b.30670.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

随机森林在全基因组关联数据集上的应用：方法学考虑与新发现。

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献