Shuryak Igor
Center for Radiological Research, Columbia University, New York, New York, United States of America.
PLoS One. 2017 Jan 9;12(1):e0170007. doi: 10.1371/journal.pone.0170007. eCollection 2017.
The ecological effects of accidental or malicious radioactive contamination are insufficiently understood because of the hazards and difficulties associated with conducting studies in radioactively-polluted areas. Data sets from severely contaminated locations can therefore be small. Moreover, many potentially important factors, such as soil concentrations of toxic chemicals, pH, and temperature, can be correlated with radiation levels and with each other. In such situations, commonly-used statistical techniques like generalized linear models (GLMs) may not be able to provide useful information about how radiation and/or these other variables affect the outcome (e.g. abundance of the studied organisms). Ensemble machine learning methods such as random forests offer powerful alternatives. We propose that analysis of small radioecological data sets by GLMs and/or machine learning can be made more informative by using the following techniques: (1) adding synthetic noise variables to provide benchmarks for distinguishing the performances of valuable predictors from irrelevant ones; (2) adding noise directly to the predictors and/or to the outcome to test the robustness of analysis results against random data fluctuations; (3) adding artificial effects to selected predictors to test the sensitivity of the analysis methods in detecting predictor effects; (4) running a selected machine learning method multiple times (with different random-number seeds) to test the robustness of the detected "signal"; (5) using several machine learning methods to test the "signal's" sensitivity to differences in analysis techniques. Here, we applied these approaches to simulated data, and to two published examples of small radioecological data sets: (I) counts of fungal taxa in samples of soil contaminated by the Chernobyl nuclear power plan accident (Ukraine), and (II) bacterial abundance in soil samples under a ruptured nuclear waste storage tank (USA). We show that the proposed techniques were advantageous compared with the methodology used in the original publications where the data sets were presented. Specifically, our approach identified a negative effect of radioactive contamination in data set I, and suggested that in data set II stable chromium could have been a stronger limiting factor for bacterial abundance than the radionuclides 137Cs and 99Tc. This new information, which was extracted from these data sets using the proposed techniques, can potentially enhance the design of radioactive waste bioremediation.
由于在放射性污染地区开展研究存在诸多危险和困难,人们对意外或恶意放射性污染的生态影响了解不足。因此,来自严重污染地区的数据集可能很小。此外,许多潜在的重要因素,如有毒化学物质的土壤浓度、pH值和温度,可能与辐射水平相互关联。在这种情况下,常用的统计技术,如广义线性模型(GLMs),可能无法提供有关辐射和/或这些其他变量如何影响结果(如所研究生物的丰度)的有用信息。诸如随机森林等集成机器学习方法提供了强大的替代方案。我们认为,通过使用以下技术,可以使GLMs和/或机器学习对小型放射生态数据集的分析更具信息性:(1)添加合成噪声变量,为区分有价值的预测变量和无关预测变量的性能提供基准;(2)直接向预测变量和/或结果添加噪声,以测试分析结果对随机数据波动的稳健性;(3)向选定的预测变量添加人为效应,以测试分析方法在检测预测变量效应方面的敏感性;(4)多次运行选定的机器学习方法(使用不同的随机数种子),以测试检测到的“信号”的稳健性;(5)使用多种机器学习方法来测试“信号”对分析技术差异的敏感性。在这里,我们将这些方法应用于模拟数据,以及两个已发表的小型放射生态数据集示例:(I)受切尔诺贝利核电站事故(乌克兰)污染的土壤样本中真菌类群的计数,以及(II)美国一个破裂的核废料储存罐下土壤样本中的细菌丰度。我们表明,与原始出版物中呈现数据集时使用的方法相比,所提出的技术具有优势。具体而言,我们的方法在数据集I中识别出放射性污染的负面影响,并表明在数据集II中,稳定铬可能比放射性核素137Cs和99Tc对细菌丰度的限制作用更强。使用所提出的技术从这些数据集中提取的这些新信息,有可能加强放射性废物生物修复的设计。