• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

研究转化为洞察:从机器学习中进行生态学习。

Study becomes insight: Ecological learning from machine learning.

作者信息

Yu Qiuyan, Ji Wenjie, Prihodko Lara, Ross C Wade, Anchang Julius Y, Hanan Niall P

机构信息

Plant and Environmental Sciences New Mexico State University Las Cruces New Mexico USA.

Department of Geography California State University Long Beach Long Beach California USA.

出版信息

Methods Ecol Evol. 2021 Nov;12(11):2117-2128. doi: 10.1111/2041-210X.13686. Epub 2021 Aug 6.

DOI:10.1111/2041-210X.13686
PMID:35874972
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9292299/
Abstract

The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental 'drivers' is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the 'learning' hidden in the ML models.We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi-variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables.We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non-influential variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three-dimensional visualizations and use of loess planes to represent independent variable effects and interactions.Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to 'learn from machine learning'.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/5bf63827b4ac/MEE3-12-2117-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/41900d643e7f/MEE3-12-2117-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/36f8ec1365d7/MEE3-12-2117-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/f4851dbb8db8/MEE3-12-2117-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/9d9a36286e82/MEE3-12-2117-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/afb6a086604b/MEE3-12-2117-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/0365c6eed8ef/MEE3-12-2117-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/5bf63827b4ac/MEE3-12-2117-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/41900d643e7f/MEE3-12-2117-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/36f8ec1365d7/MEE3-12-2117-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/f4851dbb8db8/MEE3-12-2117-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/9d9a36286e82/MEE3-12-2117-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/afb6a086604b/MEE3-12-2117-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/0365c6eed8ef/MEE3-12-2117-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1c3/9292299/5bf63827b4ac/MEE3-12-2117-g002.jpg
摘要

生态与环境科学界已采用机器学习(ML)进行实证建模和预测。然而,要超越预测去深入了解响应变量与环境“驱动因素”之间潜在的功能关系并非易事。从拟合的ML模型中得出生态见解需要运用技术来提取隐藏在ML模型中的“学习成果”。我们重新审视了从ML中得出见解的四种方法的理论背景和有效性:对自变量重要性进行排序(基尼重要性,GI;排列重要性,PI;分裂重要性,SI;以及条件排列重要性,CPI),以及两种用于推断双变量功能关系的方法(部分依赖图,PDP;以及累积局部效应图,ALE)。我们还探讨了使用替代模型来可视化和解释响应变量与环境驱动因素之间复杂的多变量关系。我们研究了使用这些解释方法提取生态见解所面临的挑战和机遇。具体而言,我们旨在通过研究有效性如何与(a)解释算法、(b)样本量以及(c)虚假解释变量的存在相关,来改进对ML模型的解释。我们的分析基于对响应变量和预测变量之间已知潜在功能关系的模拟,加入了白噪声以及相关但无影响的变量。结果表明,得出生态见解受到解释算法和虚假变量的强烈影响,样本量的影响程度适中。去除虚假变量可改善对ML模型的解释。同时,在存在虚假变量的情况下增加样本量的价值有限,但一旦省略虚假变量,增加样本量确实会提高性能。在四种排序方法中,在存在虚假变量的情况下,SI比其他方法略有效,而去除虚假变量时,GI和SI的准确性更高。PDP在检索潜在功能关系方面比ALE更有效,但在存在虚假变量时其可靠性会急剧下降。使用替代模型(包括三维可视化以及使用局部加权回归平面来表示自变量效应和相互作用)可以增强对预测变量与响应变量交互效应的可视化和解释。机器学习分析师应意识到,在与响应变量没有明确因果关系的ML模型中纳入相关自变量会干扰生态推断。当生态推断很重要时,ML模型应使用对响应变量有明确因果效应的自变量来构建。虽然为生态推断解释ML模型仍然具有挑战性,但我们表明,仔细选择解释方法、排除虚假变量以及足够的样本量可以提供更多更好的机会来“从机器学习中学习”。

相似文献

1
Study becomes insight: Ecological learning from machine learning.研究转化为洞察:从机器学习中进行生态学习。
Methods Ecol Evol. 2021 Nov;12(11):2117-2128. doi: 10.1111/2041-210X.13686. Epub 2021 Aug 6.
2
Techniques to improve ecological interpretability of black-box machine learning models.提高黑箱机器学习模型生态可解释性的技术。
J Agric Biol Environ Stat. 2021 Oct 28;27:175-197. doi: 10.1007/s13253-021-00479-7.
3
A comparative study of forest methods for time-to-event data: variable selection and predictive performance.森林方法在生存时间数据中的比较研究:变量选择和预测性能。
BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.
4
Surrogate minimal depth as an importance measure for variables in random forests.替代最小深度作为随机森林中变量的重要性度量。
Bioinformatics. 2019 Oct 1;35(19):3663-3671. doi: 10.1093/bioinformatics/btz149.
5
A note on the interpretation of tree-based regression models.关于基于树的回归模型解释的注释。
Biom J. 2020 Oct;62(6):1564-1573. doi: 10.1002/bimj.201900195. Epub 2020 May 25.
6
Predictor correlation impacts machine learning algorithms: implications for genomic studies.预测器相关性影响机器学习算法:对基因组研究的启示。
Bioinformatics. 2009 Aug 1;25(15):1884-90. doi: 10.1093/bioinformatics/btp331. Epub 2009 May 21.
7
Methods to quantify variable importance: implications for the analysis of noisy ecological data.量化变量重要性的方法:对嘈杂生态数据的分析启示
Ecology. 2009 Feb;90(2):348-55. doi: 10.1890/07-1929.1.
8
Algal community structure prediction by machine learning.基于机器学习的藻类群落结构预测
Environ Sci Ecotechnol. 2022 Dec 30;14:100233. doi: 10.1016/j.ese.2022.100233. eCollection 2023 Apr.
9
Machine learning models to predict electroencephalographic seizures in critically ill children.机器学习模型预测危重病儿的脑电图癫痫发作。
Seizure. 2021 Apr;87:61-68. doi: 10.1016/j.seizure.2021.03.001. Epub 2021 Mar 4.
10
Development of machine learning prediction models to explore nutrients predictive of cardiovascular disease using Canadian linked population-based data.利用加拿大关联的基于人群的数据集开发机器学习预测模型,以探索预测心血管疾病的营养素。
Appl Physiol Nutr Metab. 2022 May;47(5):529-546. doi: 10.1139/apnm-2021-0502.

引用本文的文献

1
Associations between organophosphorus pesticides exposure and age-related macular degeneration risk in U.S. adults: analysis from interpretable machine learning approaches.美国成年人中有机磷农药暴露与年龄相关性黄斑变性风险之间的关联:来自可解释机器学习方法的分析
Int J Ophthalmol. 2025 Jul 18;18(7):1214-1230. doi: 10.18240/ijo.2025.07.04. eCollection 2025.
2
Association between urinary cadmium levels and increased gallstone disease in US adults.美国成年人尿镉水平与胆结石疾病增加之间的关联。
Sci Rep. 2025 May 8;15(1):15974. doi: 10.1038/s41598-025-00648-5.
3
Habitat selection ecology of the aquatic beetle community using explainable machine learning.

本文引用的文献

1
Woody-biomass projections and drivers of change in sub-Saharan Africa.撒哈拉以南非洲地区的木质生物量预测及变化驱动因素
Nat Clim Chang. 2021;11:449-455. doi: 10.1038/s41558-021-01034-5. Epub 2021 May 6.
2
CAUSAL INTERPRETATIONS OF BLACK-BOX MODELS.黑箱模型的因果解释
J Bus Econ Stat. 2019;2019. doi: 10.1080/07350015.2019.1624293. Epub 2019 Jul 5.
3
Temperature-related biodiversity change across temperate marine and terrestrial systems.温度相关的生物多样性变化横跨温带海洋和陆地系统。
利用可解释机器学习研究水生甲虫群落的栖息地选择生态学。
Sci Rep. 2024 Nov 21;14(1):28903. doi: 10.1038/s41598-024-80083-0.
4
Evidence of horizontal gene transfer and environmental selection impacting antibiotic resistance evolution in soil-dwelling Listeria.水平基因转移和环境选择影响土壤中李斯特菌抗生素耐药性进化的证据。
Nat Commun. 2024 Nov 19;15(1):10034. doi: 10.1038/s41467-024-54459-9.
5
Machine learning model for age-related macular degeneration based on heavy metals: The National Health and Nutrition Examination Survey 2005 to 2008.基于重金属的年龄相关性黄斑变性机器学习模型:2005 年至 2008 年国家健康和营养检查调查。
Sci Rep. 2024 Nov 6;14(1):26913. doi: 10.1038/s41598-024-78412-4.
6
Exploring the relationship between heavy metals and diabetic retinopathy: a machine learning modeling approach.探讨重金属与糖尿病视网膜病变的关系:一种机器学习建模方法。
Sci Rep. 2024 Jun 6;14(1):13049. doi: 10.1038/s41598-024-63916-w.
7
Effects of Various Heavy Metal Exposures on Insulin Resistance in Non-diabetic Populations: Interpretability Analysis from Machine Learning Modeling Perspective.各种重金属暴露对非糖尿病人群胰岛素抵抗的影响:基于机器学习建模视角的可解释性分析
Biol Trace Elem Res. 2024 Dec;202(12):5438-5452. doi: 10.1007/s12011-024-04126-3. Epub 2024 Feb 26.
8
Predicting the effects of winter water warming in artificial lakes on zooplankton and its environment using combined machine learning models.运用组合机器学习模型预测人工湖冬季水变暖对浮游动物及其环境的影响。
Sci Rep. 2022 Sep 27;12(1):16145. doi: 10.1038/s41598-022-20604-x.
9
Perspectives in machine learning for wildlife conservation.机器学习在野生动物保护中的应用展望。
Nat Commun. 2022 Feb 9;13(1):792. doi: 10.1038/s41467-022-27980-y.
Nat Ecol Evol. 2020 Jul;4(7):927-933. doi: 10.1038/s41559-020-1185-7. Epub 2020 May 4.
4
Accounting for two-billion tons of stabilized soil carbon.核算 20 亿吨稳定土壤碳。
Sci Total Environ. 2020 Feb 10;703:134615. doi: 10.1016/j.scitotenv.2019.134615. Epub 2019 Nov 2.
5
Fire as a key driver of Earth's biodiversity.火是地球生物多样性的关键驱动因素。
Biol Rev Camb Philos Soc. 2019 Dec;94(6):1983-2010. doi: 10.1111/brv.12544. Epub 2019 Jul 12.
6
A practical introduction to Random Forest for genetic association studies in ecology and evolution.随机森林在生态学和进化中的遗传关联研究的实用介绍。
Mol Ecol Resour. 2018 Jul;18(4):755-766. doi: 10.1111/1755-0998.12773. Epub 2018 Mar 31.
7
Habitat‐ and rainfall‐dependent biodiversity responses to cattle removal in an arid woodland–grassland environment.生境和降雨依赖性生物多样性对干旱林地-草原环境中牛移除的响应。
Ecol Appl. 2014;24(8):2013-28.
8
An experimental study of the intrinsic stability of random forest variable importance measures.随机森林变量重要性度量内在稳定性的实验研究
BMC Bioinformatics. 2016 Feb 3;17:60. doi: 10.1186/s12859-016-0900-5.
9
Evolving ecological networks and the emergence of biodiversity patterns across temperature gradients.不断演变的生态网络与跨温度梯度生物多样性模式的出现。
Proc Biol Sci. 2012 Mar 22;279(1731):1051-60. doi: 10.1098/rspb.2011.1733. Epub 2011 Sep 21.
10
Letter to the editor: on the stability and ranking of predictors from random forest variable importance measures.致编辑的信:关于随机森林变量重要性度量的预测因子的稳定性和排名。
Brief Bioinform. 2011 Jul;12(4):369-73. doi: 10.1093/bib/bbr016. Epub 2011 Apr 15.