• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于大型基因型数据的不确定性推断特征重要性。

Inferring feature importance with uncertainties with application to large genotype data.

机构信息

SINTEF DIGITAL, Oslo, Norway.

Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway.

出版信息

PLoS Comput Biol. 2023 Mar 14;19(3):e1010963. doi: 10.1371/journal.pcbi.1010963. eCollection 2023 Mar.

DOI:10.1371/journal.pcbi.1010963
PMID:36917581
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10038287/
Abstract

Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.

摘要

估计特征重要性,即由于某个特征导致的预测或多个预测的贡献,是解释基于数据的模型的一个重要方面。除了解释模型本身之外,一个同样相关的问题是,在潜在的数据生成过程中哪些特征是重要的。我们提出了一个基于 Shapley 值的框架,用于推断单个特征的重要性,包括估计器中的不确定性。我们基于最近发布的 SAGE(Shapley 可加全局重要性)的无模型特征重要性得分进行构建,并引入了 Sub-SAGE。对于基于树的模型,它的优点是可以在不进行计算成本高昂的重采样的情况下进行估计。我们认为,对于所有模型类型,我们的 Sub-SAGE 估计器的不确定性都可以使用自举法进行估计,并针对树集成方法演示了该方法。该框架在合成数据以及用于预测肥胖特征重要性的大型基因型数据上进行了示例。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/2d7446326dfb/pcbi.1010963.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/9a61aee8562d/pcbi.1010963.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/96d401b668e7/pcbi.1010963.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/c90aa9554ff6/pcbi.1010963.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/2d7446326dfb/pcbi.1010963.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/9a61aee8562d/pcbi.1010963.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/96d401b668e7/pcbi.1010963.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/c90aa9554ff6/pcbi.1010963.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ee5/10038287/2d7446326dfb/pcbi.1010963.g004.jpg

相似文献

1
Inferring feature importance with uncertainties with application to large genotype data.基于大型基因型数据的不确定性推断特征重要性。
PLoS Comput Biol. 2023 Mar 14;19(3):e1010963. doi: 10.1371/journal.pcbi.1010963. eCollection 2023 Mar.
2
Shapley variable importance cloud for interpretable machine learning.用于可解释机器学习的Shapley变量重要性云图
Patterns (N Y). 2022 Feb 22;3(4):100452. doi: 10.1016/j.patter.2022.100452. eCollection 2022 Apr 8.
3
Efficient Shapley Explanation For Features Importance Estimation Under Uncertainty.不确定性下特征重要性估计的高效沙普利解释
Med Image Comput Comput Assist Interv. 2020;12261:792-801. doi: 10.1007/978-3-030-59710-8_77. Epub 2020 Sep 29.
4
Prediction of the Critical Temperature of Superconductors Based on Two-Layer Feature Selection and the Optuna-Stacking Ensemble Learning Model.基于双层特征选择和Optuna-Stacking集成学习模型的超导体临界温度预测
ACS Omega. 2023 Jan 13;8(3):3078-3090. doi: 10.1021/acsomega.2c06324. eCollection 2023 Jan 24.
5
Explanation of machine learning models using shapley additive explanation and application for real data in hospital.使用 Shapley 加法解释对机器学习模型进行解释,并将其应用于医院的真实数据。
Comput Methods Programs Biomed. 2022 Feb;214:106584. doi: 10.1016/j.cmpb.2021.106584. Epub 2021 Dec 10.
6
Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions.使用 Shapley 值解释机器学习模型:在化合物效力和多靶点活性预测中的应用。
J Comput Aided Mol Des. 2020 Oct;34(10):1013-1026. doi: 10.1007/s10822-020-00314-0. Epub 2020 May 2.
7
Machine Learning Models for Predicting Influential Factors of Early Outcomes in Acute Ischemic Stroke: Registry-Based Study.用于预测急性缺血性卒中早期预后影响因素的机器学习模型:基于登记处的研究
JMIR Med Inform. 2022 Mar 25;10(3):e32508. doi: 10.2196/32508.
8
Marginal Contribution Feature Importance - an Axiomatic Approach for Explaining Data.边际贡献特征重要性——一种解释数据的公理方法。
Proc Mach Learn Res. 2021 Jul;139:1324-1335.
9
Efficient nonparametric statistical inference on population feature importance using Shapley values.使用夏普利值对总体特征重要性进行高效非参数统计推断。
Proc Mach Learn Res. 2020 Jul;119:10282-10291.
10
Sensitivity and uncertainty analysis for the annual phosphorus loss estimator model.年磷流失估算模型的灵敏度和不确定性分析。
J Environ Qual. 2013 Jul;42(4):1109-18. doi: 10.2134/jeq2012.0418.

引用本文的文献

1
Regulating genome language models: navigating policy challenges at the intersection of AI and genetics.规范基因组语言模型:应对人工智能与遗传学交叉领域的政策挑战
Hum Genet. 2025 Sep 16. doi: 10.1007/s00439-025-02768-4.
2
Identifying key factors in cell fate decisions by machine learning interpretable strategies.通过机器学习可解释策略识别细胞命运决定的关键因素。
J Biol Phys. 2023 Dec;49(4):443-462. doi: 10.1007/s10867-023-09640-4. Epub 2023 Jul 17.

本文引用的文献

1
Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations.纳入 SNPs 和 PRS 的非线性机器学习模型可改善不同人群的多基因预测。
Commun Biol. 2022 Aug 22;5(1):856. doi: 10.1038/s42003-022-03812-z.
2
Model independent feature attributions: Shapley values that uncover non-linear dependencies.与模型无关的特征归因:揭示非线性依赖关系的沙普利值。
PeerJ Comput Sci. 2021 Jun 2;7:e582. doi: 10.7717/peerj-cs.582. eCollection 2021.
3
A new method for exploring gene-gene and gene-environment interactions in GWAS with tree ensemble methods and SHAP values.
基于树集成方法和 SHAP 值的 GWAS 中基因-基因和基因-环境相互作用的新探索方法。
BMC Bioinformatics. 2021 May 4;22(1):230. doi: 10.1186/s12859-021-04041-7.
4
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.美国国立卫生研究院生物医学高级研究与发展局(NHLBI)TOPMed 项目中对 53831 个不同基因组进行测序。
Nature. 2021 Feb;590(7845):290-299. doi: 10.1038/s41586-021-03205-y. Epub 2021 Feb 10.
5
From Local Explanations to Global Understanding with Explainable AI for Trees.利用可解释人工智能实现从局部解释到树木的全局理解
Nat Mach Intell. 2020 Jan;2(1):56-67. doi: 10.1038/s42256-019-0138-9. Epub 2020 Jan 17.
6
Exploring and visualizing large-scale genetic associations by using PheWeb.使用PheWeb探索和可视化大规模基因关联。
Nat Genet. 2020 Jun;52(6):550-552. doi: 10.1038/s41588-020-0622-5.
7
Contribution of genetics to visceral adiposity and its relation to cardiovascular and metabolic disease.遗传学对内脏脂肪的影响及其与心血管和代谢疾病的关系。
Nat Med. 2019 Sep;25(9):1390-1395. doi: 10.1038/s41591-019-0563-7. Epub 2019 Sep 9.
8
The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库,具有深度表型和基因组数据。
Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.
9
Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.在大规模的遗传关联研究中,有效地控制病例-对照不平衡和样本相关性。
Nat Genet. 2018 Sep;50(9):1335-1341. doi: 10.1038/s41588-018-0184-y. Epub 2018 Aug 13.
10
The personal and clinical utility of polygenic risk scores.多基因风险评分的个体和临床效用。
Nat Rev Genet. 2018 Sep;19(9):581-590. doi: 10.1038/s41576-018-0018-x.