The University of Texas at Austin, Civil, Architectural and Environmental Engineering, 301 E. Dean Keeton St., Austin, TX 78712, United States; Oregon State University, Environmental and Molecular Toxicology, 1007 Agriculture and Life Sciences Building, Corvallis, OR 97331, United States.
Virginia Commonwealth University, Department of Biostatistics, 830 East Main St., Richmond, VA 23298, United States.
Sci Total Environ. 2019 Mar 10;655:512-519. doi: 10.1016/j.scitotenv.2018.11.022. Epub 2018 Nov 5.
Unregulated private wells in the United States are susceptible to many groundwater contaminants. Ingestion of nitrate, the most common anthropogenic private well contaminant in the United States, can lead to the endogenous formation of N-nitroso-compounds, which are known human carcinogens. In this study, we expand upon previous efforts to model private well groundwater nitrate concentration in North Carolina by developing multiple machine learning models and testing against out-of-sample prediction. Our purpose was to develop exposure estimates in unmonitored areas for use in the Agricultural Health Study (AHS) cohort. Using approximately 22,000 private well nitrate measurements in North Carolina, we trained and tested continuous models including a censored maximum likelihood-based linear model, random forest, gradient boosted machine, support vector machine, neural networks, and kriging. Continuous nitrate models had low predictive performance (R < 0.33), so multiple random forest classification models were also trained and tested. The final classification approach predicted <1 mg/L, 1-5 mg/L, and ≥5 mg/L using a random forest model with 58 variables and maximizing the Cohen's kappa statistic. The final model had an overall accuracy of 0.75 and high specificity for the higher two categories and high sensitivity for the lowest category. The results will be used for the categorical prediction of private well nitrate for AHS cohort participants that reside in North Carolina.
美国不受监管的私人水井容易受到许多地下水污染物的影响。在美国,最常见的人为私人井污染物硝酸盐的摄入会导致内源性形成 N-亚硝基化合物,这是已知的人类致癌物。在这项研究中,我们通过开发多个机器学习模型并进行样本外预测来扩展之前对北卡罗来纳州私人井地下水硝酸盐浓度进行建模的工作。我们的目的是为未监测地区开发暴露估计值,以供农业健康研究 (AHS) 队列使用。使用北卡罗来纳州大约 22000 个私人井硝酸盐测量值,我们训练和测试了连续模型,包括基于 censored maximum likelihood 的线性模型、随机森林、梯度提升机、支持向量机、神经网络和克里金插值。连续硝酸盐模型的预测性能较低 (R<0.33),因此还训练和测试了多个随机森林分类模型。最终的分类方法使用具有 58 个变量的随机森林模型和最大化科恩氏kappa 统计量来预测 <1mg/L、1-5mg/L 和 ≥5mg/L。最终模型的整体准确性为 0.75,对于较高的两个类别具有较高的特异性,对于最低的类别具有较高的敏感性。结果将用于居住在北卡罗来纳州的 AHS 队列参与者的私人井硝酸盐的分类预测。