在森林中挑选单核苷酸多态性。

Picking single-nucleotide polymorphisms in forests.

作者信息

Schwarz Daniel F, Szymczak Silke, Ziegler Andreas, König Inke R

机构信息

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany.

出版信息

BMC Proc. 2007;1 Suppl 1(Suppl 1):S59. doi: 10.1186/1753-6561-1-s1-s59. Epub 2007 Dec 18.

DOI:10.1186/1753-6561-1-s1-s59

PMID:18466559

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2367487/

Abstract

With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

摘要

随着高通量单核苷酸多态性（SNP）技术的发展，小样本中大量的SNP对经典统计方法的应用构成了挑战。一种可能的解决方案是对病例对照数据采用两阶段方法，在第一阶段，筛选测试选择少量SNP进行进一步分析。然后在第二阶段使用逻辑回归（logReg）估计所选变量的效应。在此，我们介绍一种新方法，其中SNP的选择基于随机森林（RF）估计的排列重要性。为此，我们使用了为遗传分析研讨会15提供的模拟数据，而不知道真实模型。数据集被随机分为第一个和第二个数据集。在第一阶段，生长随机森林以预选择37个最重要的变量，通过单倍型标签将这些变量减少到32个。在第二阶段，我们使用逻辑回归估计参数。对于五个模拟位点获得了最高的效应估计值。我们检测到吸烟、性别和父母的DR等位基因作为协变量。在进行多重检验校正后，我们在模拟的对类风湿性关节炎风险有直接影响的四个基因中识别出两个，并且所有协变量均无任何假阳性。我们表明，采用随机森林筛选SNP的两阶段方法适用于在复杂疾病的全基因组关联研究中检测候选SNP。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0978/2367487/96415fd5bfae/1753-6561-1-S1-S59-1.jpg

相似文献

Picking single-nucleotide polymorphisms in forests.

BMC Proc. 2007;1 Suppl 1(Suppl 1):S59. doi: 10.1186/1753-6561-1-s1-s59. Epub 2007 Dec 18.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks.

BMC Proc. 2007;1 Suppl 1(Suppl 1):S56. doi: 10.1186/1753-6561-1-s1-s56. Epub 2007 Dec 18.

The impact of missing and erroneous genotypes on tagging SNP selection and power of subsequent association tests.

Hum Hered. 2006;61(1):31-44. doi: 10.1159/000092141. Epub 2006 Mar 23.

Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data.

BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S3. doi: 10.1186/1472-6947-13-S1-S3. Epub 2013 Apr 5.

Screening large-scale association study data: exploiting interactions using random forests.

BMC Genet. 2004 Dec 10;5:32. doi: 10.1186/1471-2156-5-32.

An omnibus permutation test on ensembles of two-locus analyses can detect pure epistasis and genetic heterogeneity in genome-wide association studies.

Springerplus. 2013 May 19;2:230. doi: 10.1186/2193-1801-2-230. eCollection 2013.

Prioritize and select SNPs for association studies with multi-stage designs.

J Comput Biol. 2008 Apr;15(3):241-57. doi: 10.1089/cmb.2007.0090.

Mining gold dust under the genome wide significance level: a two-stage approach to analysis of GWAS.

Genet Epidemiol. 2011 Feb;35(2):111-8. doi: 10.1002/gepi.20556. Epub 2010 Dec 31.

Comparison of tagging single-nucleotide polymorphism methods in association analyses.

BMC Proc. 2007;1 Suppl 1(Suppl 1):S6. doi: 10.1186/1753-6561-1-s1-s6. Epub 2007 Dec 18.

引用本文的文献

Combining Random Forests and a Signal Detection Method Leads to the Robust Detection of Genotype-Phenotype Associations.

Genes (Basel). 2020 Aug 5;11(8):892. doi: 10.3390/genes11080892.

Ensemble learning for detecting gene-gene interactions in colorectal cancer.

PeerJ. 2018 Oct 29;6:e5854. doi: 10.7717/peerj.5854. eCollection 2018.

Evaluation of potential novel variations and their interactions related to bipolar disorders: analysis of genome-wide association study data.

Neuropsychiatr Dis Treat. 2016 Nov 24;12:2997-3004. doi: 10.2147/NDT.S112558. eCollection 2016.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Parallel classification and feature selection in microarray data using SPRINT.

Concurr Comput. 2014 Mar 25;26(4):854-865. doi: 10.1002/cpe.2928.

Random forest fishing: a novel approach to identifying organic group of risk factors in genome-wide association studies.

Eur J Hum Genet. 2014 Feb;22(2):254-9. doi: 10.1038/ejhg.2013.109. Epub 2013 May 22.

Random forests for genetic association studies.

Stat Appl Genet Mol Biol. 2011;10(1):32. doi: 10.2202/1544-6115.1691. Epub 2011 Jul 12.

SNP interaction detection with Random Forests in high-dimensional genetic data.

BMC Bioinformatics. 2012 Jul 15;13:164. doi: 10.1186/1471-2105-13-164.

On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data.

Bioinformatics. 2010 Jul 15;26(14):1752-8. doi: 10.1093/bioinformatics/btq257. Epub 2010 May 26.

Selection of important variables by statistical learning in genome-wide association analysis.

BMC Proc. 2009 Dec 15;3 Suppl 7(Suppl 7):S70. doi: 10.1186/1753-6561-3-s7-s70.

本文引用的文献

Bias in random forest variable importance measures: illustrations, sources and a solution.

BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25.

The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases.

BMC Genet. 2006 Apr 21;7:23. doi: 10.1186/1471-2156-7-23.

Gene selection and classification of microarray data using random forest.

BMC Bioinformatics. 2006 Jan 6;7:3. doi: 10.1186/1471-2105-7-3.

Screening large-scale association study data: exploiting interactions using random forests.

BMC Genet. 2004 Dec 10;5:32. doi: 10.1186/1471-2156-5-32.

Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power.

Hum Hered. 2003;56(1-3):18-31. doi: 10.1159/000073729.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在森林中挑选单核苷酸多态性。

Picking single-nucleotide polymorphisms in forests.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献