Cruz-Ramírez Nicandro, Acosta-Mesa Héctor Gabriel, Mezura-Montes Efrén, Guerra-Hernández Alejandro, Hoyos-Rivera Guillermo de Jesús, Barrientos-Martínez Rocío Erandi, Gutiérrez-Fragoso Karina, Nava-Fernández Luis Alonso, González-Gaspar Patricia, Novoa-del-Toro Elva María, Aguilera-Rueda Vicente Josué, Ameca-Alducin María Yaneli
Facultad de Física e Inteligencia Artificial, Universidad Veracruzana, Xalapa, Veracruz, México.
Centro de Investigaciones Biomédicas, Universidad Veracruzana, Xalapa, Veracruz, México.
PLoS One. 2014 Mar 26;9(3):e92866. doi: 10.1371/journal.pone.0092866. eCollection 2014.
The bias-variance dilemma is a well-known and important problem in Machine Learning. It basically relates the generalization capability (goodness of fit) of a learning method to its corresponding complexity. When we have enough data at hand, it is possible to use these data in such a way so as to minimize overfitting (the risk of selecting a complex model that generalizes poorly). Unfortunately, there are many situations where we simply do not have this required amount of data. Thus, we need to find methods capable of efficiently exploiting the available data while avoiding overfitting. Different metrics have been proposed to achieve this goal: the Minimum Description Length principle (MDL), Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC), among others. In this paper, we focus on crude MDL and empirically evaluate its performance in selecting models with a good balance between goodness of fit and complexity: the so-called bias-variance dilemma, decomposition or tradeoff. Although the graphical interaction between these dimensions (bias and variance) is ubiquitous in the Machine Learning literature, few works present experimental evidence to recover such interaction. In our experiments, we argue that the resulting graphs allow us to gain insights that are difficult to unveil otherwise: that crude MDL naturally selects balanced models in terms of bias-variance, which not necessarily need be the gold-standard ones. We carry out these experiments using a specific model: a Bayesian network. In spite of these motivating results, we also should not overlook three other components that may significantly affect the final model selection: the search procedure, the noise rate and the sample size.
偏差 - 方差困境是机器学习中一个广为人知且重要的问题。它基本上将学习方法的泛化能力(拟合优度)与其相应的复杂度联系起来。当我们手头有足够的数据时,有可能以这样一种方式使用这些数据,从而将过拟合(选择一个泛化能力差的复杂模型的风险)降至最低。不幸的是,在许多情况下,我们根本没有所需数量的数据。因此,我们需要找到能够有效利用可用数据同时避免过拟合的方法。为实现这一目标,人们提出了不同的指标:最小描述长度原则(MDL)、赤池信息准则(AIC)和贝叶斯信息准则(BIC)等。在本文中,我们聚焦于原始MDL,并通过实证评估其在选择拟合优度和复杂度之间达到良好平衡的模型时的性能:即所谓的偏差 - 方差困境、分解或权衡。尽管这些维度(偏差和方差)之间的图形交互在机器学习文献中无处不在,但很少有作品给出实验证据来揭示这种交互。在我们的实验中,我们认为所得到的图形使我们能够获得通过其他方式难以揭示的见解:原始MDL在偏差 - 方差方面自然会选择平衡的模型,而这些模型不一定是黄金标准模型。我们使用一个特定的模型:贝叶斯网络来进行这些实验。尽管有这些令人鼓舞的结果,但我们也不应忽视其他三个可能对最终模型选择产生重大影响的因素:搜索过程、噪声率和样本大小。