Peterson Ryan A, McGrath Max, Cavanaugh Joseph E
Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado, Anschutz Medical Campus, 13001 E. 17th Pl, Aurora, CO 80045, USA.
Department of Biostatistics, College of Public Health, University of Iowa, 145 N. Riverside Dr., Iowa City, IA 52245, USA.
Entropy (Basel). 2024 Aug 31;26(9):746. doi: 10.3390/e26090746.
We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in varying the shade of the opacity of black box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions a priori compared to main effects, and hence, the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open source R package, sparseR) to the test in a predictive model "bakeoff" (i.e., a benchmarking study of ML algorithms applied "out of the box", that is, with no special tuning). Algorithms were trained on a large set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluated the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black box approaches. We found that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications.
我们开发了一种新颖的机器学习(ML)算法,目标是生成透明模型(即人类可理解的模型),同时灵活地考虑非线性和相互作用。我们的方法基于排序稀疏性,它允许在改变黑箱机器学习方法不透明度的深浅时具有灵活性和用户可控性。排序稀疏性的主要原则是,与主效应相比,算法应先验地对高阶多项式和相互作用持更怀疑的态度,因此,纳入这些更复杂的项应需要更高水平的证据。在这项工作中,我们将新的排序稀疏性算法(如开源R包sparseR中所实现的)在一个预测模型“烘焙赛”(即对“开箱即用”应用的ML算法进行基准测试研究,即不进行特殊调优)中进行了测试。算法在来自宾夕法尼亚机器学习基准数据库的大量模拟和真实世界数据集上进行训练,解决回归和二元分类问题。我们评估了我们这种以人类为中心的算法在何种程度上能够达到与神经网络、随机森林和支持向量机等流行黑箱方法相媲美的预测准确性,同时还能生成更具可解释性的模型。使用袋外误差作为元结果,我们描述了以人类为中心的方法能够与黑箱方法表现相当或更好的数据集的属性。我们发现,在大多数真实世界数据集中,可解释方法的预测效果最佳,或者与最优方法的预测效果相差在5%以内。我们针对几个案例研究,对随机森林与可解释方法的性能进行了更深入的比较,包括算法表现相似的示例,以及可解释方法表现不佳的几个案例。这项工作为在预测建模应用中纳入我们这样以人类为中心的透明算法提供了有力的理论依据。