随机森林中的最大条件卡方重要性。

Maximal conditional chi-square importance in random forests.

机构信息

Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034, USA.

出版信息

Bioinformatics. 2010 Mar 15;26(6):831-7. doi: 10.1093/bioinformatics/btq038. Epub 2010 Feb 3.

DOI:10.1093/bioinformatics/btq038

PMID:20130032

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2832825/

Abstract

MOTIVATION

High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings.

RESULTS

We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.

CONTACT

heping.zhang@yale.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

全基因组关联研究（GWAS）和其他研究经常会产生高维数据。识别与疾病相关的 GWAS 中单核苷酸多态性（SNP）等特征非常重要。随机森林代表了一种非常有用的方法，使用变量重要性评分。该重要性评分存在几个缺点。我们提出了一种替代的重要性度量方法来克服这些缺点。

结果

我们使用随机森林中的最大条件卡方（MCC）作为 SNP 与性状之间关联的度量，对各种模型下的多个 SNP 的效应进行了特征描述，该度量条件于其他 SNP。基于此重要性度量，我们采用置换检验来估计 SNP 的经验 P 值。我们的方法与单变量检验和置换检验（使用 Gini 和置换重要性）进行了比较。在模拟中，所提出的方法在识别风险 SNP 方面始终优于其他方法。在年龄相关性黄斑变性的 GWAS 中，所提出的方法证实了两个具有统计学意义的 SNP（在全基因组调整的 0.05 水平上）。进一步的分析表明，这两个 SNP 符合异质性模型。与现有的重要性度量相比，MCC 重要性度量通过利用不同 SNP 的条件信息，对风险 SNP 的复杂效应更加敏感。基于 MCC 重要性度量的置换检验为 GWAS 中识别候选 SNP 提供了一种有效的方法，并有助于理解遗传变异与复杂疾病之间的病因关系。

联系人

heping.zhang@yale.edu

补充信息

补充数据可在 Bioinformatics 在线获取。

相似文献

Maximal conditional chi-square importance in random forests.随机森林中的最大条件卡方重要性。

Bioinformatics. 2010 Mar 15;26(6):831-7. doi: 10.1093/bioinformatics/btq038. Epub 2010 Feb 3.

Uncovering networks from genome-wide association studies via circular genomic permutation.通过环状基因组置换从全基因组关联研究中揭示网络

G3 (Bethesda). 2012 Sep;2(9):1067-75. doi: 10.1534/g3.112.002618. Epub 2012 Sep 1.

Screening large-scale association study data: exploiting interactions using random forests.筛选大规模关联研究数据：利用随机森林探索相互作用

BMC Genet. 2004 Dec 10;5:32. doi: 10.1186/1471-2156-5-32.

Integrate multiple traits to detect novel trait-gene association using GWAS summary data with an adaptive test approach.利用 GWAS 汇总数据和自适应检验方法整合多种性状，以检测新的性状-基因关联。

Bioinformatics. 2019 Jul 1;35(13):2251-2257. doi: 10.1093/bioinformatics/bty961.

Tagging SNP-set selection with maximum information based on linkage disequilibrium structure in genome-wide association studies.基于全基因组关联研究中连锁不平衡结构的最大信息进行 SNP 集选择标记。

Bioinformatics. 2017 Jul 15;33(14):2078-2081. doi: 10.1093/bioinformatics/btx151.

Selecting Closely-Linked SNPs Based on Local Epistatic Effects for Haplotype Construction Improves Power of Association Mapping.基于局部上位效应选择紧密连锁 SNPs 进行单倍型构建可提高关联作图的功效。

G3 (Bethesda). 2019 Dec 3;9(12):4115-4126. doi: 10.1534/g3.119.400451.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Prioritizing genetic variants in GWAS with lasso using permutation-assisted tuning.使用排列辅助调优的lasso 优先考虑 GWAS 中的遗传变异。

Bioinformatics. 2020 Jun 1;36(12):3811-3817. doi: 10.1093/bioinformatics/btaa229.

Performance of random forest when SNPs are in linkage disequilibrium.单核苷酸多态性处于连锁不平衡状态时随机森林的性能。

BMC Bioinformatics. 2009 Mar 5;10:78. doi: 10.1186/1471-2105-10-78.

Improved methods for multi-trait fine mapping of pleiotropic risk loci.多效性风险位点多性状精细定位的改进方法。

Bioinformatics. 2017 Jan 15;33(2):248-255. doi: 10.1093/bioinformatics/btw615. Epub 2016 Sep 22.

引用本文的文献

Establishment of a Preoperative Laboratory Panel to identify Lymph Node Metastasis in Superficial Esophageal Cancer.建立术前实验室检查指标以识别浅表性食管癌的淋巴结转移

J Cancer. 2022 Apr 11;13(7):2238-2245. doi: 10.7150/jca.71114. eCollection 2022.

A zero altered Poisson random forest model for genomic-enabled prediction.用于基因组辅助预测的零改变泊松随机森林模型。

G3 (Bethesda). 2021 Feb 9;11(2). doi: 10.1093/g3journal/jkaa057.

Application of data mining for predicting hemodynamics instability during pheochromocytoma surgery.应用数据挖掘预测嗜铬细胞瘤手术中血液动力学不稳定。

BMC Med Inform Decis Mak. 2020 Jul 20;20(1):165. doi: 10.1186/s12911-020-01180-4.

Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods.使用三种机器学习方法鉴定的单核苷酸多态性（SNP）子集对育种值进行基因组预测。

Front Genet. 2018 Jul 4;9:237. doi: 10.3389/fgene.2018.00237. eCollection 2018.

Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.通过机器学习方法在基层医疗电子健康记录中定义疾病表型：以类风湿关节炎识别为例的案例研究

PLoS One. 2016 May 2;11(5):e0154515. doi: 10.1371/journal.pone.0154515. eCollection 2016.

Random forest classification of etiologies for an orphan disease.罕见病病因的随机森林分类

Stat Med. 2015 Feb 28;34(5):887-99. doi: 10.1002/sim.6351. Epub 2014 Nov 3.

Transcriptome classification reveals molecular subtypes in psoriasis.转录组分类揭示银屑病的分子亚型。

BMC Genomics. 2012 Sep 12;13:472. doi: 10.1186/1471-2164-13-472.

Random forests for genetic association studies.用于基因关联研究的随机森林算法。

Stat Appl Genet Mol Biol. 2011;10(1):32. doi: 10.2202/1544-6115.1691. Epub 2011 Jul 12.

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?生命科学中的随机森林数据挖掘：是漫步公园还是迷失丛林？

Brief Bioinform. 2013 May;14(3):315-26. doi: 10.1093/bib/bbs034. Epub 2012 Jul 10.

Random forests for genomic data analysis.随机森林在基因组数据分析中的应用。

Genomics. 2012 Jun;99(6):323-9. doi: 10.1016/j.ygeno.2012.04.003. Epub 2012 Apr 21.

本文引用的文献

Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests.在一项类风湿性关节炎研究中使用随机森林检测显著的单核苷酸多态性。

BMC Proc. 2009 Dec 15;3 Suppl 7(Suppl 7):S69. doi: 10.1186/1753-6561-3-s7-s69.

A permutation-based multiple testing method for time-course microarray experiments.基于排列的时间序列基因表达微阵列实验的多重检验方法。

BMC Bioinformatics. 2009 Oct 15;10:336. doi: 10.1186/1471-2105-10-336.

The influence of carnosinase gene polymorphisms on diabetic nephropathy risk in African-Americans.肌肽酶基因多态性对非裔美国人糖尿病肾病风险的影响。

Hum Genet. 2009 Aug;126(2):265-75. doi: 10.1007/s00439-009-0667-0. Epub 2009 Apr 17.

Performance of random forest when SNPs are in linkage disequilibrium.单核苷酸多态性处于连锁不平衡状态时随机森林的性能。

BMC Bioinformatics. 2009 Mar 5;10:78. doi: 10.1186/1471-2105-10-78.

A random forest approach to the detection of epistatic interactions in case-control studies.一种用于病例对照研究中检测上位性相互作用的随机森林方法。

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S65. doi: 10.1186/1471-2105-10-S1-S65.

A tree-based method for modeling a multivariate ordinal response.一种用于对多元有序响应进行建模的基于树的方法。

Stat Interface. 2008;1(1):169-178. doi: 10.4310/sii.2008.v1.n1.a14.

Enriched random forests.增强随机森林

Bioinformatics. 2008 Sep 15;24(18):2010-4. doi: 10.1093/bioinformatics/btn356. Epub 2008 Jul 22.

The NEI/NCBI dbGAP database: genotypes and haplotypes that may specifically predispose to risk of neovascular age-related macular degeneration.美国国立眼科研究所/美国国立生物技术信息中心数据库：可能特别易患新生血管性年龄相关性黄斑变性风险的基因型和单倍型。

BMC Med Genet. 2008 Jun 9;9:51. doi: 10.1186/1471-2350-9-51.

Multiple gene polymorphisms in the complement factor h gene are associated with exudative age-related macular degeneration in chinese.在中国，补体因子H基因中的多个基因多态性与渗出性年龄相关性黄斑变性相关。

Invest Ophthalmol Vis Sci. 2008 Aug;49(8):3312-7. doi: 10.1167/iovs.07-1517. Epub 2008 Apr 17.

A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes.一种在基于微阵列的基因表达研究中识别生理反应的框架：生物学相关基因的选择与解读

Physiol Genomics. 2008 Mar 14;33(1):78-90. doi: 10.1152/physiolgenomics.00167.2007. Epub 2007 Dec 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验