• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个透明的机器学习算法能比其黑箱对应算法预测得更好吗?一项使用110个数据集的基准研究。

Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets.

作者信息

Peterson Ryan A, McGrath Max, Cavanaugh Joseph E

机构信息

Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado, Anschutz Medical Campus, 13001 E. 17th Pl, Aurora, CO 80045, USA.

Department of Biostatistics, College of Public Health, University of Iowa, 145 N. Riverside Dr., Iowa City, IA 52245, USA.

出版信息

Entropy (Basel). 2024 Aug 31;26(9):746. doi: 10.3390/e26090746.

DOI:10.3390/e26090746
PMID:39330080
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11431724/
Abstract

We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in varying the shade of the opacity of black box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions a priori compared to main effects, and hence, the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open source R package, sparseR) to the test in a predictive model "bakeoff" (i.e., a benchmarking study of ML algorithms applied "out of the box", that is, with no special tuning). Algorithms were trained on a large set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluated the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black box approaches. We found that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications.

摘要

我们开发了一种新颖的机器学习(ML)算法,目标是生成透明模型(即人类可理解的模型),同时灵活地考虑非线性和相互作用。我们的方法基于排序稀疏性,它允许在改变黑箱机器学习方法不透明度的深浅时具有灵活性和用户可控性。排序稀疏性的主要原则是,与主效应相比,算法应先验地对高阶多项式和相互作用持更怀疑的态度,因此,纳入这些更复杂的项应需要更高水平的证据。在这项工作中,我们将新的排序稀疏性算法(如开源R包sparseR中所实现的)在一个预测模型“烘焙赛”(即对“开箱即用”应用的ML算法进行基准测试研究,即不进行特殊调优)中进行了测试。算法在来自宾夕法尼亚机器学习基准数据库的大量模拟和真实世界数据集上进行训练,解决回归和二元分类问题。我们评估了我们这种以人类为中心的算法在何种程度上能够达到与神经网络、随机森林和支持向量机等流行黑箱方法相媲美的预测准确性,同时还能生成更具可解释性的模型。使用袋外误差作为元结果,我们描述了以人类为中心的方法能够与黑箱方法表现相当或更好的数据集的属性。我们发现,在大多数真实世界数据集中,可解释方法的预测效果最佳,或者与最优方法的预测效果相差在5%以内。我们针对几个案例研究,对随机森林与可解释方法的性能进行了更深入的比较,包括算法表现相似的示例,以及可解释方法表现不佳的几个案例。这项工作为在预测建模应用中纳入我们这样以人类为中心的透明算法提供了有力的理论依据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/b20b29970777/entropy-26-00746-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/95a9ec37775b/entropy-26-00746-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/1c6ff7785279/entropy-26-00746-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/5b5bf99a1750/entropy-26-00746-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/3da37d86370a/entropy-26-00746-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/7128a45e787f/entropy-26-00746-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/9e97ff2c88bf/entropy-26-00746-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/b20b29970777/entropy-26-00746-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/95a9ec37775b/entropy-26-00746-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/1c6ff7785279/entropy-26-00746-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/5b5bf99a1750/entropy-26-00746-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/3da37d86370a/entropy-26-00746-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/7128a45e787f/entropy-26-00746-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/9e97ff2c88bf/entropy-26-00746-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21f/11431724/b20b29970777/entropy-26-00746-g007.jpg

相似文献

1
Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets.一个透明的机器学习算法能比其黑箱对应算法预测得更好吗?一项使用110个数据集的基准研究。
Entropy (Basel). 2024 Aug 31;26(9):746. doi: 10.3390/e26090746.
2
Interpretable machine learning models for hospital readmission prediction: a two-step extracted regression tree approach.可解释的机器学习模型在医院再入院预测中的应用:一种两步提取回归树方法。
BMC Med Inform Decis Mak. 2023 Jun 5;23(1):104. doi: 10.1186/s12911-023-02193-5.
3
A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification.八种机器学习算法在十个临床代谢组学数据集上进行二进制分类的广义预测能力的比较评估。
Metabolomics. 2019 Nov 15;15(12):150. doi: 10.1007/s11306-019-1612-4.
4
Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在(放化疗)治疗结果预测中的应用:分类器的实证比较。
Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.
5
Explainable Machine Learning Framework for Image Classification Problems: Case Study on Glioma Cancer Prediction.用于图像分类问题的可解释机器学习框架:脑胶质瘤癌症预测案例研究
J Imaging. 2020 May 28;6(6):37. doi: 10.3390/jimaging6060037.
6
R.ROSETTA: an interpretable machine learning framework.R.ROSETTA:一个可解释的机器学习框架。
BMC Bioinformatics. 2021 Mar 6;22(1):110. doi: 10.1186/s12859-021-04049-z.
7
Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States.机器学习和土地利用回归在精细时空估算环境空气污染中的比较:在美国大陆范围内模拟臭氧浓度。
Environ Int. 2020 Sep;142:105827. doi: 10.1016/j.envint.2020.105827. Epub 2020 Jun 25.
8
Deep convolutional neural network and IoT technology for healthcare.用于医疗保健的深度卷积神经网络和物联网技术。
Digit Health. 2024 Jan 17;10:20552076231220123. doi: 10.1177/20552076231220123. eCollection 2024 Jan-Dec.
9
Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?预测模型工具能否识别 ACL 重建术后阿片类药物使用时间延长的高风险患者?
Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.
10
IHCP: interpretable hepatitis C prediction system based on black-box machine learning models.IHCP:基于黑盒机器学习模型的可解释丙型肝炎预测系统。
BMC Bioinformatics. 2023 Sep 6;24(1):333. doi: 10.1186/s12859-023-05456-0.

本文引用的文献

1
Ordered quantile normalization: a semiparametric transformation built for the cross-validation era.有序分位数归一化:一种为交叉验证时代构建的半参数变换。
J Appl Stat. 2019 Jun 15;47(13-15):2312-2327. doi: 10.1080/02664763.2019.1630372. eCollection 2020.
2
PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods.PMLB v1.0:用于基准测试机器学习方法的开源数据集集合。
Bioinformatics. 2022 Jan 12;38(3):878-880. doi: 10.1093/bioinformatics/btab727.
3
Dissecting racial bias in an algorithm used to manage the health of populations.
剖析用于管理人群健康的算法中的种族偏见。
Science. 2019 Oct 25;366(6464):447-453. doi: 10.1126/science.aax2342.
4
A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.系统评价显示,机器学习在临床预测模型中并未优于逻辑回归。
J Clin Epidemiol. 2019 Jun;110:12-22. doi: 10.1016/j.jclinepi.2019.02.004. Epub 2019 Feb 11.
5
Marginal false discovery rates for penalized regression models.惩罚回归模型的边缘假发现率。
Biostatistics. 2019 Apr 1;20(2):299-314. doi: 10.1093/biostatistics/kxy004.
6
PMLB: a large benchmark suite for machine learning evaluation and comparison.PMLB:一个用于机器学习评估和比较的大型基准测试套件。
BioData Min. 2017 Dec 11;10:36. doi: 10.1186/s13040-017-0154-4. eCollection 2017.
7
Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints.现代建模技术对数据需求极大:一项用于预测二分结局的模拟研究。
BMC Med Res Methodol. 2014 Dec 22;14:137. doi: 10.1186/1471-2288-14-137.
8
COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION.用于非凸惩罚回归的坐标下降算法及其在生物特征选择中的应用
Ann Appl Stat. 2011 Jan 1;5(1):232-253. doi: 10.1214/10-AOAS388.
9
Diagnosis of sleep apnea by automatic analysis of nasal pressure and forced oscillation impedance.通过鼻压力和强迫振荡阻抗的自动分析诊断睡眠呼吸暂停。
Am J Respir Crit Care Med. 2002 Apr 1;165(7):940-4. doi: 10.1164/ajrccm.165.7.2106018.
10
International application of a new probability algorithm for the diagnosis of coronary artery disease.一种用于诊断冠状动脉疾病的新概率算法的国际应用。
Am J Cardiol. 1989 Aug 1;64(5):304-10. doi: 10.1016/0002-9149(89)90524-9.