Tavallali Peyman, Razavi Marianne, Brady Sean
Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, California, United States of America.
Principium Consulting, LLC, Pasadena, California, United States of America.
PLoS One. 2017 Nov 13;12(11):e0187676. doi: 10.1371/journal.pone.0187676. eCollection 2017.
In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables.
在本文中,我们提出了一种新的数据挖掘算法,通过该算法,既能捕捉数据中的非线性特征,又能找到最佳子集模型。为了生成原始变量的增强子集,一种理想的选择方法应具备增加补充回归分析层次的潜力,即通过对预测变量进行数学变换以及探索组合变量的协同效应来捕捉数据中的复杂关系。我们在此提出的方法有潜力生成变量的最优子集,从而使模型选择的整个过程更高效。该算法通过对原始输入进行变换引入可解释的参数,并且能忠实拟合数据。本文的核心目标是为经典最小二乘回归框架引入一种新的估计技术。这种新的自动变量变换和模型选择方法能够提供一个最优且稳定的模型,该模型能使均方误差和变异性最小化,同时将所有可能的子集选择方法与包含变量变换和相互作用相结合。此外,该方法能控制多重共线性,从而得到一组最优的解释变量。