用于雨水水质预测的多元回归模型的校准与验证：数据划分、数据集大小及特征的影响

Calibration and validation of multiple regression models for stormwater quality prediction: data partitioning, effect of dataset size and characteristics.

作者信息

Mourad M, Bertrand-Krajewski J L, Chebbo G

机构信息

Laboratoire URGC Hydrologie Urbaine, INSA de Lyon, 34 avenue des Arts, 69621 Villeurbanne, France.

出版信息

Water Sci Technol. 2005;52(3):45-52.

PMID:16206843

Abstract

Two main issues regarding stormwater quality models have been investigated: i) the effect of calibration dataset size and characteristics on calibration and validation results; ii) the optimal split of available data into calibration and validation subsets. Data from 13 catchments have been used for three pollutants: BOD, COD and SS. Three multiple regression models were calibrated and validated. The use of different data sets and different models allows viewing general trends. It was found mainly that multiple regression models are case sensitive to calibration data. Few data used for calibration infers bad predictions despite good calibration results. It was also found that the random split of available data into halves for calibration and validation is not optimal. More data should be allocated to calibration. The proportion of data to be used for validation increases with the number of available data (N) and reaches about 35% for N around 55 measured events.

摘要

针对雨水水质模型的两个主要问题进行了研究

i）校准数据集大小和特征对校准和验证结果的影响；ii）将可用数据最佳划分为校准子集和验证子集。来自13个集水区的数据被用于三种污染物：生化需氧量（BOD）、化学需氧量（COD）和悬浮固体（SS）。对三个多元回归模型进行了校准和验证。使用不同的数据集和不同的模型有助于观察总体趋势。主要发现多元回归模型对校准数据很敏感。尽管校准结果良好，但用于校准的数据较少会导致预测不佳。还发现将可用数据随机对半分为校准集和验证集并非最优。应将更多数据分配给校准。用于验证的数据比例会随着可用数据数量（N）的增加而增加，对于约55个测量事件的N值，该比例达到约35%。