Wheeler David C, Nolan Bernard T, Flory Abigail R, DellaValle Curt T, Ward Mary H
Department of Biostatistics, Virginia Commonwealth University, 830 East Main St, Richmond, VA 23298, United States.
U.S. Geological Survey, Reston, VA, United States.
Sci Total Environ. 2015 Dec 1;536:481-488. doi: 10.1016/j.scitotenv.2015.07.080. Epub 2015 Jul 30.
Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square=0.77) and was acceptable in the testing set (r-square=0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort.
在该国许多农业地区,硝酸盐对饮用水的污染问题日益严重。摄入硝酸盐会导致内源性形成N-亚硝基化合物,这是一种强力致癌物。我们开发了爱荷华州私人水井中硝酸盐浓度的预测模型。利用私人水井中34084次硝酸盐测量数据,我们通过系统评估36个主题组(井深、与落水洞的距离、位置、土地利用、土壤特征、氮输入、气象及其他因素)中179个变量的预测性能,对随机森林模型进行了训练和测试,以预测硝酸盐水平的对数值。最终模型包含17个组中的66个变量。一些最重要的变量是井深、距井1公里范围内的坡度长度、采样年份以及与最近动物饲养场的距离。在训练集中,观测到的和估计的硝酸盐浓度之间的相关性非常好(决定系数r² = 0.77),在测试集中也是可以接受的(决定系数r² = 0.38)。随机森林模型的预测性能明显优于传统线性回归模型或回归树。我们的模型将用于研究农业健康研究队列中爱荷华州参与者饮用水中硝酸盐水平与癌症风险之间的关联。