• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

两种机器学习算法在存在协变量情况下于基因关联研究中的应用。

Application of two machine learning algorithms to genetic association studies in the presence of covariates.

作者信息

Nonyane Bareng A S, Foulkes Andrea S

机构信息

Division of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts Amherst, MA, USA.

出版信息

BMC Genet. 2008 Nov 14;9:71. doi: 10.1186/1471-2156-9-71.

DOI:10.1186/1471-2156-9-71
PMID:19014573
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2620353/
Abstract

BACKGROUND

Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized.

METHODS AND RESULTS

In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided.

CONCLUSION

Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

摘要

背景

旨在揭示基因型与性状关联的基于人群的调查通常涉及高维基因多态性数据以及多个环境和临床参数的信息。机器学习(ML)算法提供了一种直接的分析方法,用于选择这些输入中对预定义性状最具预测性的子集。然而,这些算法在存在协变量的情况下的性能尚未得到很好的表征。

方法与结果

在本手稿中,我们研究了两种方法:随机森林(RF)和多元自适应回归样条(MARS)。通过多项模拟研究,评估了几种潜在模型下的性能。还提供了对一组接受抗逆转录病毒疗法的HIV-1感染者的应用。

结论

与更传统的回归建模理论一致,我们的研究结果强调了在应用ML算法之前考虑潜在基因-协变量-性状关系性质的重要性,特别是在存在潜在混杂或效应介导的情况下。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/64b3ce6b9759/1471-2156-9-71-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/631866448c48/1471-2156-9-71-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/6f8feeff5eca/1471-2156-9-71-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/c23179c742f6/1471-2156-9-71-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/64b3ce6b9759/1471-2156-9-71-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/631866448c48/1471-2156-9-71-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/6f8feeff5eca/1471-2156-9-71-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/c23179c742f6/1471-2156-9-71-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8e6/2620353/64b3ce6b9759/1471-2156-9-71-4.jpg

相似文献

1
Application of two machine learning algorithms to genetic association studies in the presence of covariates.两种机器学习算法在存在协变量情况下于基因关联研究中的应用。
BMC Genet. 2008 Nov 14;9:71. doi: 10.1186/1471-2156-9-71.
2
Latent variable modeling paradigms for genotype-trait association studies.用于基因-性状关联研究的潜在变量建模范式。
Biom J. 2011 Sep;53(5):838-54. doi: 10.1002/bimj.201000218.
3
Multiple imputation and random forests (MIRF) for unobservable, high-dimensional data.
Int J Biostat. 2007;3(1):Article 12. doi: 10.2202/1557-4679.1049.
4
A resampling-based approach to multiple testing with uncertainty in phase.
Int J Biostat. 2007;3(1):Article 2. doi: 10.2202/1557-4679.1037.
5
Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.基于 Hadoop 的随机森林在多变量神经影像学表型全基因组关联研究中的应用。
BMC Bioinformatics. 2013;14 Suppl 16(Suppl 16):S6. doi: 10.1186/1471-2105-14-S16-S6. Epub 2013 Oct 22.
6
Pathway-based identification of SNPs predictive of survival.基于通路的 SNP 预测生存分析。
Eur J Hum Genet. 2011 Jun;19(6):704-9. doi: 10.1038/ejhg.2011.3. Epub 2011 Feb 2.
7
An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.随机森林在全基因组关联数据集上的应用:方法学考虑与新发现。
BMC Genet. 2010 Jun 14;11:49. doi: 10.1186/1471-2156-11-49.
8
A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology.基于机器学习方法在遗传流行病学中检测基因-基因相互作用的研究综述。
Biomed Res Int. 2013;2013:432375. doi: 10.1155/2013/432375. Epub 2013 Oct 21.
9
TRM: a powerful two-stage machine learning approach for identifying SNP-SNP interactions.TRM:一种用于识别单核苷酸多态性(SNP)-SNP相互作用的强大的两阶段机器学习方法。
Ann Hum Genet. 2012 Jan;76(1):53-62. doi: 10.1111/j.1469-1809.2011.00692.x. Epub 2011 Dec 11.
10
Vitamin D time profile based on the contribution of non-genetic and genetic factors in HIV-infected individuals of European ancestry.基于非遗传和遗传因素对欧洲血统HIV感染者维生素D时间分布的影响
Antivir Ther. 2015;20(3):261-9. doi: 10.3851/IMP2823. Epub 2014 Jul 17.

引用本文的文献

1
Covariate adjusted classification trees.协变量调整分类树
Biostatistics. 2018 Jan 1;19(1):42-53. doi: 10.1093/biostatistics/kxx015.
2
RAPIDSNPs: A new computational pipeline for rapidly identifying key genetic variants reveals previously unidentified SNPs that are significantly associated with individual platelet responses.RAPIDSNPs:一种用于快速识别关键基因变异的新计算流程揭示了与个体血小板反应显著相关的先前未识别的单核苷酸多态性。
PLoS One. 2017 Apr 25;12(4):e0175957. doi: 10.1371/journal.pone.0175957. eCollection 2017.
3
EPAS1 gene variants are associated with sprint/power athletic performance in two cohorts of European athletes.

本文引用的文献

1
Multiple imputation and random forests (MIRF) for unobservable, high-dimensional data.
Int J Biostat. 2007;3(1):Article 12. doi: 10.2202/1557-4679.1049.
2
Targeted maximum likelihood based causal inference: Part I.基于靶向最大似然法的因果推断:第一部分。
Int J Biostat. 2010;6(2):Article 2. doi: 10.2202/1557-4679.1211.
3
A comparison of analytical methods for genetic association studies.基因关联研究分析方法的比较
Genet Epidemiol. 2008 Dec;32(8):767-78. doi: 10.1002/gepi.20345.
EPAS1基因变异与两组欧洲运动员的短跑/力量运动表现相关。
BMC Genomics. 2014 May 18;15(1):382. doi: 10.1186/1471-2164-15-382.
4
Integrative systems biology approaches in asthma pharmacogenomics.哮喘药物基因组学中的综合系统生物学方法。
Pharmacogenomics. 2012 Sep;13(12):1387-404. doi: 10.2217/pgs.12.126.
5
An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.一种综合方法,可降低小等位基因频率和连锁不平衡对全基因组数据中变量重要性度量的影响。
Bioinformatics. 2012 Oct 15;28(20):2615-23. doi: 10.1093/bioinformatics/bts483. Epub 2012 Jul 30.
6
TRM: a powerful two-stage machine learning approach for identifying SNP-SNP interactions.TRM:一种用于识别单核苷酸多态性(SNP)-SNP相互作用的强大的两阶段机器学习方法。
Ann Hum Genet. 2012 Jan;76(1):53-62. doi: 10.1111/j.1469-1809.2011.00692.x. Epub 2011 Dec 11.
7
Systems biology data analysis methodology in pharmacogenomics.系统生物学数据分析方法在药物基因组学中的应用。
Pharmacogenomics. 2011 Sep;12(9):1349-60. doi: 10.2217/pgs.11.76.
8
An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.随机森林在全基因组关联数据集上的应用:方法学考虑与新发现。
BMC Genet. 2010 Jun 14;11:49. doi: 10.1186/1471-2156-11-49.
4
An overview of statistical learning theory.统计学习理论概述。
IEEE Trans Neural Netw. 1999;10(5):988-99. doi: 10.1109/72.788640.
5
Bias in random forest variable importance measures: illustrations, sources and a solution.随机森林变量重要性度量中的偏差:示例、来源及解决方案
BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25.
6
Multilocus analyses of Renin-Angiotensin-aldosterone system gene variants on blood pressure at rest and during behavioral stress in young normotensive subjects.年轻血压正常受试者静息及行为应激时肾素 - 血管紧张素 - 醛固酮系统基因变异对血压影响的多位点分析
Hypertension. 2007 Jan;49(1):107-12. doi: 10.1161/01.HYP.0000251524.00326.e7. Epub 2006 Nov 20.
7
Simulation of correlated continuous and categorical variables using a single multivariate distribution.使用单一多元分布对相关连续变量和分类变量进行模拟。
J Pharmacokinet Pharmacodyn. 2006 Dec;33(6):773-94. doi: 10.1007/s10928-006-9033-1. Epub 2006 Oct 12.
8
Relating HIV-1 sequence variation to replication capacity via trees and forests.通过树状图和森林图将HIV-1序列变异与复制能力相关联。
Stat Appl Genet Mol Biol. 2004;3:Article2; discussion article 7, article 9. doi: 10.2202/1544-6115.1031. Epub 2004 Feb 12.
9
A Bayesian toolkit for genetic association studies.用于基因关联研究的贝叶斯工具包。
Genet Epidemiol. 2006 Apr;30(3):231-47. doi: 10.1002/gepi.20140.
10
Associations among race/ethnicity, ApoC-III genotypes, and lipids in HIV-1-infected individuals on antiretroviral therapy.接受抗逆转录病毒治疗的HIV-1感染者的种族/族裔、载脂蛋白C-III基因型与血脂之间的关联。
PLoS Med. 2006 Mar;3(3):e52. doi: 10.1371/journal.pmed.0030052.