Suppr超能文献

使用随机森林对X染色体数据进行建模:克服性别偏差。

Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.

作者信息

Winham Stacey J, Jenkins Gregory D, Biernacka Joanna M

机构信息

Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America.

Department of Psychiatry and Psychology, Mayo Clinic, Rochester, Minnesota, United States of America.

出版信息

Genet Epidemiol. 2016 Feb;40(2):123-32. doi: 10.1002/gepi.21946. Epub 2015 Dec 7.

Abstract

Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

摘要

包括随机森林(RF)在内的机器学习方法越来越多地用于遗传数据分析。然而,标准的RF算法不能正确地对X染色体单核苷酸多态性(SNP)的效应进行建模,导致变量重要性的估计存在偏差。我们提出了RF的扩展方法来正确地对X染色体SNP进行建模,包括一种分层方法和一种基于X染色体失活过程的方法。我们将新的和标准的RF方法应用于成瘾:基因与环境研究(SAGE)中的病例对照酒精依赖数据,并通过模拟研究比较了不同方法的性能。将标准RF应用于酒精依赖的病例对照研究时,即使将性别作为一个变量纳入,对X染色体SNP的变量重要性估计也会过高,但新的RF方法的结果与基于单变量回归且能正确对X染色体数据进行建模的方法一致。模拟结果表明,当性别与性状相关时,新的RF方法消除了标准RF在X染色体SNP变量重要性方面的偏差,并且能够检测出因果性常染色体和X染色体SNP。即使在没有性别效应的情况下,新的扩展方法的表现也与标准RF类似。因此,我们提供了一种强大的多标记遗传分析方法,能够以无偏差的方式处理X染色体数据。该方法在免费的R包“snpRF”(http://www.cran.r-project.org/web/packages/snpRF/)中实现。

相似文献

1
Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.
Genet Epidemiol. 2016 Feb;40(2):123-32. doi: 10.1002/gepi.21946. Epub 2015 Dec 7.
2
Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.
BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.
3
Single Nucleotide Polymorphism relevance learning with Random Forests for Type 2 diabetes risk prediction.
Artif Intell Med. 2018 Apr;85:43-49. doi: 10.1016/j.artmed.2017.09.005. Epub 2017 Sep 22.
4
Testing and estimation of X-chromosome SNP effects: Impact of model assumptions.
Genet Epidemiol. 2021 Sep;45(6):577-592. doi: 10.1002/gepi.22393. Epub 2021 Jun 3.
5
Detecting associated single-nucleotide polymorphisms on the X chromosome in case control genome-wide association studies.
Stat Methods Med Res. 2017 Apr;26(2):567-582. doi: 10.1177/0962280214551815. Epub 2014 Sep 24.
6
2LD, GENECOUNTING and HAP: Computer programs for linkage disequilibrium analysis.
Bioinformatics. 2004 May 22;20(8):1325-6. doi: 10.1093/bioinformatics/bth071. Epub 2004 Feb 10.
7
Association tests for X-chromosomal markers--a comparison of different test statistics.
Hum Hered. 2011;71(1):23-36. doi: 10.1159/000323768. Epub 2011 Feb 16.
8
Identifying SNPs predictive of phenotype using random forests.
Genet Epidemiol. 2005 Feb;28(2):171-82. doi: 10.1002/gepi.20041.

引用本文的文献

1
Intersections of machine learning and epidemiological methods for health services research.
Int J Epidemiol. 2021 Jan 23;49(6):1763-1770. doi: 10.1093/ije/dyaa035.
2
Statistical learning approaches in the genetic epidemiology of complex diseases.
Hum Genet. 2020 Jan;139(1):73-84. doi: 10.1007/s00439-019-01996-9. Epub 2019 May 2.
3
Viewing the male-specific chromosome Y in a new light.
Eur J Hum Genet. 2017 Nov;25(11):1177-1178. doi: 10.1038/ejhg.2017.135. Epub 2017 Aug 30.

本文引用的文献

1
Genetics of cardiovascular disease: Importance of sex and ethnicity.
Atherosclerosis. 2015 Jul;241(1):219-28. doi: 10.1016/j.atherosclerosis.2015.03.021. Epub 2015 Mar 16.
2
Second-generation PLINK: rising to the challenge of larger and richer datasets.
Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.
3
Accounting for eXentricities: analysis of the X chromosome in GWAS reveals X-linked genes implicated in autoimmune diseases.
PLoS One. 2014 Dec 5;9(12):e113684. doi: 10.1371/journal.pone.0113684. eCollection 2014.
4
5
A Weighted Random Forests Approach to Improve Predictive Performance.
Stat Anal Data Min. 2013 Dec 1;6(6):496-505. doi: 10.1002/sam.11196.
7
How to include chromosome X in your genome-wide association study.
Genet Epidemiol. 2014 Feb;38(2):97-103. doi: 10.1002/gepi.21782. Epub 2014 Jan 9.
9
eXclusion: toward integrating the X chromosome in genome-wide association analyses.
Am J Hum Genet. 2013 May 2;92(5):643-7. doi: 10.1016/j.ajhg.2013.03.017.
10
An AUC-based permutation variable importance measure for random forests.
BMC Bioinformatics. 2013 Apr 5;14:119. doi: 10.1186/1471-2105-14-119.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验