存在非加性相互作用时遗传关联随机森林模型解释方法的比较

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

作者信息

Orlenko Alena, Moore Jason H

机构信息

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.

出版信息

BioData Min. 2021 Jan 29;14(1):9. doi: 10.1186/s13040-021-00243-0.

DOI:10.1186/s13040-021-00243-0

PMID:33514397

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7847145/

Abstract

BACKGROUND

Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis.

RESULTS

To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions.

CONCLUSIONS

By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

摘要

背景

基因间的非加性相互作用常与多种表型相关，包括阿尔茨海默病、糖尿病和心血管疾病等已知的复杂疾病。检测相互作用需要仔细选择分析方法，一些机器学习算法无法检测或对表现出非加性的特征相互作用进行建模，或者能力不足。由于能够检测和对非加性相互作用进行建模，随机森林方法常用于这些研究中。此外，随机森林具有估计特征重要性得分的内置能力，这一特性使得模型能够根据特征与结果关联的顺序和效应大小进行解释。这一特性对于流行病学和临床研究非常重要，在这些研究中，预测模型的结果可用于确定研究工作的未来方向。解释模型的另一种方法是使用排列特征重要性度量，该方法采用排列方法以模型性能下降的单位来计算特征贡献系数，以及使用基于合作博弈论方法的沙普利加性解释。目前，尚不清楚哪种随机森林特征重要性度量能在基因关联分析中对特征的真正信息贡献提供更优估计。

结果

为解决这一问题并提高随机森林预测的可解释性，我们在具有非加性相互作用的真实和模拟数据集中比较了不同的特征重要性估计方法。结果，我们检测到真实世界数据集的度量之间存在差异，并进一步确定排列特征重要性度量为具有非加性相互作用的模拟数据集提供了更精确的特征重要性排名估计。

结论

通过对真实和模拟数据的分析，我们确定在存在非加性相互作用的情况下，排列特征重要性度量提供了更精确的特征重要性排名估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f35/7847145/cc7ce8fc6665/13040_2021_243_Fig1_HTML.jpg

相似文献

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.存在非加性相互作用时遗传关联随机森林模型解释方法的比较

BioData Min. 2021 Jan 29;14(1):9. doi: 10.1186/s13040-021-00243-0.

Detecting gene-gene interactions using a permutation-based random forest method.使用基于排列的随机森林方法检测基因-基因相互作用。

BioData Min. 2016 Apr 6;9:14. doi: 10.1186/s13040-016-0093-5. eCollection 2016.

Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在（放化疗）治疗结果预测中的应用：分类器的实证比较。

Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.

A multicenter random forest model for effective prognosis prediction in collaborative clinical research network.多中心随机森林模型在协作临床研究网络中的有效预后预测。

Artif Intell Med. 2020 Mar;103:101814. doi: 10.1016/j.artmed.2020.101814. Epub 2020 Feb 5.

A random forest based biomarker discovery and power analysis framework for diagnostics research.基于随机森林的生物标志物发现和诊断研究功效分析框架。

BMC Med Genomics. 2020 Nov 23;13(1):178. doi: 10.1186/s12920-020-00826-6.

MS-CPFI: A model-agnostic Counterfactual Perturbation Feature Importance algorithm for interpreting black-box Multi-State models.MS-CPFI：一种用于解释黑盒多态模型的与模型无关的反事实扰动特征重要性算法。

Artif Intell Med. 2024 Jan;147:102741. doi: 10.1016/j.artmed.2023.102741. Epub 2023 Nov 29.

Explainable machine learning models based on multimodal time-series data for the early detection of Parkinson's disease.基于多模态时间序列数据的可解释机器学习模型用于帕金森病的早期检测。

Comput Methods Programs Biomed. 2023 Jun;234:107495. doi: 10.1016/j.cmpb.2023.107495. Epub 2023 Mar 23.

Advancing aircraft engine RUL predictions: an interpretable integrated approach of feature engineering and aggregated feature importance.推进飞机发动机剩余使用寿命预测：一种特征工程与聚合特征重要性的可解释集成方法。

Sci Rep. 2023 Aug 18;13(1):13466. doi: 10.1038/s41598-023-40315-1.

Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets.随机森林和线性模型在基准数据集上的预测性能与可解释性比较

J Chem Inf Model. 2017 Aug 28;57(8):1773-1792. doi: 10.1021/acs.jcim.6b00753. Epub 2017 Aug 2.

Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?预测模型工具能否识别 ACL 重建术后阿片类药物使用时间延长的高风险患者？

Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.

引用本文的文献

Detecting genetic interactions with visible neural networks.检测与可见神经网络的基因相互作用。

Commun Biol. 2025 Jun 5;8(1):874. doi: 10.1038/s42003-025-08157-x.

Predicting anorexia nervosa treatment efficacy: an explainable machine learning approach.预测神经性厌食症的治疗效果：一种可解释的机器学习方法。

J Eat Disord. 2025 Jun 2;13(1):97. doi: 10.1186/s40337-025-01265-3.

Web-based machine learning application for interpretable prediction of prolonged length of stay after lumbar spinal stenosis surgery: a retrospective cohort study with explainable AI.基于网络的机器学习应用程序用于腰椎管狭窄症手术后住院时间延长的可解释预测：一项使用可解释人工智能的回顾性队列研究

Front Physiol. 2025 Feb 19;16:1542240. doi: 10.3389/fphys.2025.1542240. eCollection 2025.

Comparing statistical learning methods for complex trait prediction from gene expression.比较用于从基因表达预测复杂性状的统计学习方法。

PLoS One. 2025 Feb 11;20(2):e0317516. doi: 10.1371/journal.pone.0317516. eCollection 2025.

Comparing statistical learning methods for complex trait prediction from gene expression.比较用于从基因表达预测复杂性状的统计学习方法。

bioRxiv. 2024 Jun 3:2024.06.01.596951. doi: 10.1101/2024.06.01.596951.

Protein characteristics substantially influence the propensity of activity cliffs among kinase inhibitors.蛋白质特性极大地影响了激酶抑制剂中活性峰的倾向。

Sci Rep. 2024 Apr 20;14(1):9058. doi: 10.1038/s41598-024-59501-w.

Discovering SNP-disease relationships in genome-wide SNP data using an improved harmony search based on SNP locus and genetic inheritance patterns.利用基于 SNP 位置和遗传遗传模式的改进和声搜索在全基因组 SNP 数据中发现 SNP 疾病关系。

PLoS One. 2023 Oct 13;18(10):e0292266. doi: 10.1371/journal.pone.0292266. eCollection 2023.

Automated quantitative trait locus analysis (AutoQTL).自动数量性状基因座分析（AutoQTL）。

BioData Min. 2023 Apr 10;16(1):14. doi: 10.1186/s13040-023-00331-3.

Toward characterizing cardiovascular fitness using machine learning based on unobtrusive data.利用基于非侵入性数据的机器学习方法来描述心血管健康状况。

PLoS One. 2023 Mar 2;18(3):e0282398. doi: 10.1371/journal.pone.0282398. eCollection 2023.

Interpretable machine learning for dementia: A systematic review.可解释机器学习在痴呆症中的应用：系统综述。

Alzheimers Dement. 2023 May;19(5):2135-2149. doi: 10.1002/alz.12948. Epub 2023 Feb 3.

本文引用的文献

The revival of the Gini importance?基尼重要性的复兴？

Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.

PMLB: a large benchmark suite for machine learning evaluation and comparison.PMLB：一个用于机器学习评估和比较的大型基准测试套件。

BioData Min. 2017 Dec 11;10:36. doi: 10.1186/s13040-017-0154-4. eCollection 2017.

Leveraging putative enhancer-promoter interactions to investigate two-way epistasis in Type 2 Diabetes GWAS.利用假定的增强子-启动子相互作用研究2型糖尿病全基因组关联研究中的双向上位性。

Pac Symp Biocomput. 2018;23:548-558.

A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods.一种用于模拟任意复杂度开放数据的启发式方法，可用于比较和评估机器学习方法。

Pac Symp Biocomput. 2018;23:259-267.

An Exploration of Gene-Gene Interactions and Their Effects on Hypertension.基因-基因相互作用及其对高血压影响的探索

Int J Genomics. 2017;2017:7208318. doi: 10.1155/2017/7208318. Epub 2017 May 31.

Genome-wide two-locus interaction analysis identifies multiple epistatic SNP pairs that confer risk of prostate cancer: A cross-population study.全基因组两基因座相互作用分析鉴定多个导致前列腺癌风险的上位 SNP 对：一项跨人群研究。

Int J Cancer. 2017 May 1;140(9):2075-2084. doi: 10.1002/ijc.30622. Epub 2017 Feb 10.

Detecting gene-gene interactions using a permutation-based random forest method.使用基于排列的随机森林方法检测基因-基因相互作用。

BioData Min. 2016 Apr 6;9:14. doi: 10.1186/s13040-016-0093-5. eCollection 2016.

Do little interactions get lost in dark random forests?微小的相互作用会在黑暗的随机森林中消失吗？

BMC Bioinformatics. 2016 Mar 31;17:145. doi: 10.1186/s12859-016-0995-8.

Discovery of gene-gene interactions across multiple independent data sets of late onset Alzheimer disease from the Alzheimer Disease Genetics Consortium.从阿尔茨海默病遗传学联盟的多个晚发性阿尔茨海默病独立数据集中发现基因-基因相互作用。

Neurobiol Aging. 2016 Feb;38:141-150. doi: 10.1016/j.neurobiolaging.2015.10.031. Epub 2015 Nov 6.

A survey about methods dedicated to epistasis detection.一项关于用于上位性检测方法的调查。

Front Genet. 2015 Sep 10;6:285. doi: 10.3389/fgene.2015.00285. eCollection 2015.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

存在非加性相互作用时遗传关联随机森林模型解释方法的比较

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献