Hanczar Blaise, Hua Jianping, Dougherty Edward R
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.
EURASIP J Bioinform Syst Biol. 2007;2007(1):38473. doi: 10.1155/2007/38473.
The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.
许多微阵列实验的目的是构建鉴别诊断和预后模型。鉴于特征数量巨大而样本数量较少,模型有效性(即误差估计的精度)是一个关键问题。先前的研究通过偏差分布(估计误差减去真实误差)来解决这个问题,特别是在使用特征选择来减轻峰值现象(过拟合)的高维环境中交叉验证精度的恶化。由于分类器设计基于随机样本,真实误差和估计误差都是依赖于样本的随机变量,如果估计误差和真实误差没有很好的相关性,人们会预期精度会有所损失,因此会自然产生关于相关性程度以及缺乏相关性影响误差估计的方式等问题。我们通过对偏差分布的方差进行分解来证明相关性对误差精度的影响,观察到在高维环境中相关性通常会严重降低,并表明高维度对误差估计的影响更多地源于其去相关效应,而不是对估计误差方差的影响。我们使用合成数据和真实数据、几种特征选择方法、不同的分类规则以及三种常用的误差估计器(留一法交叉验证、k折交叉验证和.632自举法)来考虑不同实验条件下真实误差和估计误差之间的相关性。此外,考虑了三种情况:(1)特征选择,(2)已知特征集,(3)所有特征。只有第一种情况具有实际意义;然而,为了比较目的需要另外两种情况。我们将观察到,在已知特征集的情况下,真实误差和估计误差的相关性往往比特征选择或使用所有特征的情况更强,后两者之间较好的相关性没有普遍趋势,但因不同模型而异。