Wang Meng, Brunekreef Bert, Gehring Ulrike, Szpiro Adam, Hoek Gerard, Beelen Rob
From the aInstitute for Risk Assessment Sciences, Utrecht University, Utrecht, The Netherlands; bDepartment of Environmental and Occupational Health Sciences, University of Washington, Seattle, WA; cJulius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands; and dDepartment of Biostatistics, University of Washington, Seattle, WA.
Epidemiology. 2016 Jan;27(1):51-6. doi: 10.1097/EDE.0000000000000404.
Leave-one-out cross-validation that fails to account for variable selection does not properly reflect prediction accuracy when the number of training sites is small. The impact on health effect estimates has rarely been studied. The objective of this study was to develop an improved validation procedure for land-use regression models with variable selection and investigate health effect estimates in relation to land-use regression model performance.
We randomly generated 10 training and test sets for nitrogen dioxide and particulate matter. For each training set, we developed models and evaluated them using a cross-holdout validation approach. Cross-holdout validation develops new models for each evaluation compared with refitting the model without variable selection, as in standard leave-one-out cross-validation. We also implemented holdout validation, which evaluates model predictions using independent test sets. We evaluated the relationship between cross-holdout validation and holdout validation R and estimates of the association between air pollution and forced vital capacity in the Dutch birth cohort.
Cross-holdout validation Rs were generally identical to holdout validation Rs, but were notably smaller than leave-one-out cross-validation Rs. Decreases in forced vital capacity in relation to air pollution exposure were larger for land-use regression models that had larger holdout validation and cross-holdout validation Rs rather than leave-one-out cross-validation R.
Cross-holdout validation accurately reflects predictive ability of land-use regression models and is a useful validation approach for small datasets. Land-use regression predictive ability in terms of holdout validation and cross-holdout validation rather than leave-one-out cross-validation was associated with the magnitude of health effect estimates in a case study.
当训练地点数量较少时,未考虑变量选择的留一法交叉验证不能正确反映预测准确性。对健康效应估计的影响很少被研究。本研究的目的是开发一种改进的验证程序,用于具有变量选择的土地利用回归模型,并研究与土地利用回归模型性能相关的健康效应估计。
我们随机生成了10个二氧化氮和颗粒物的训练集和测试集。对于每个训练集,我们开发模型并使用交叉验证法进行评估。与标准留一法交叉验证中不进行变量选择而重新拟合模型不同,交叉验证法在每次评估时开发新模型。我们还实施了验证法,使用独立测试集评估模型预测。我们评估了交叉验证法与验证法的R值之间的关系,以及荷兰出生队列中空气污染与用力肺活量之间关联的估计值。
交叉验证法的R值通常与验证法的R值相同,但明显小于留一法交叉验证的R值。对于具有较大验证法和交叉验证法R值而非留一法交叉验证R值的土地利用回归模型,空气污染暴露导致的用力肺活量下降幅度更大。
交叉验证法准确反映了土地利用回归模型的预测能力,是小数据集的一种有用验证方法。在一个案例研究中,基于验证法和交叉验证法而非留一法交叉验证的土地利用回归预测能力与健康效应估计的大小相关。