Saleh Omar, Otim Francesca Nyega, Otim Ochan
Department of Humanities and Sciences, University of California - Los Angeles, Los Angeles, CA 90024, USA.
Department of Anthropology, University of California, Davis, 1 Shields Ave, Davis, CA 95616, USA.
Sci Total Environ. 2023 Nov 25;901:165946. doi: 10.1016/j.scitotenv.2023.165946. Epub 2023 Aug 2.
Benthic sediment toxicity is linked to harmful effects in marine organisms and humans, and an understanding of the link would require, in part, a comprehensive and exhaustive analysis of sediment toxicity data already in hand. One tool which could aid in the process is machine learning (ML), a supervised classification modeling technique that has transformed how actionable insight are acquired from large datasets. The current study is a test of concept in which an ML classifier is sought that can accurately extrapolate the characteristics of a 5437 California-wide coastal training dataset (assembled from 1635 samples) to predict sediment toxicity in southern California bight (SCB). Twelve classifiers were trained to recognize sediment toxicity using 70 % of the dataset and among them, a Gradient Boosting Classifier (GBC) model using latitude, longitude, and water depth was found to be the most accurate at predicting toxicity (83 %). Among the variables, latitude was found to be the most significant driver of prediction by GBC in this test ecosystem. The performance of the model was verified with the remaining 30 % of the dataset and found to be 83 % accurate. Presented with 884 unfamiliar data points assembled from 854 measurements at 346 stations across SCB, GBC was 87 % accurate post-training, thus demonstrating a role supervised learning can play in the southern California environmental analytics.
底栖沉积物毒性与海洋生物和人类的有害影响相关,而要理解这种关联,部分需要对现有的沉积物毒性数据进行全面详尽的分析。机器学习(ML)是一种可以辅助这一过程的工具,它是一种监督分类建模技术,已经改变了从大型数据集中获取可操作见解的方式。当前的研究是一项概念验证,旨在寻找一种能够准确推断加利福尼亚州范围内5437个沿海训练数据集(由1635个样本组成)的特征,以预测南加利福尼亚湾(SCB)沉积物毒性的ML分类器。使用该数据集的70%对12个分类器进行训练以识别沉积物毒性,其中,使用纬度、经度和水深的梯度提升分类器(GBC)模型在预测毒性方面最为准确(83%)。在这些变量中,在这个测试生态系统中,纬度被发现是GBC预测的最主要驱动因素。该模型的性能用数据集的其余30%进行了验证,发现准确率为83%。面对从SCB的346个站点的854次测量中收集的884个不熟悉的数据点,GBC在训练后准确率为87%,从而证明了监督学习在南加利福尼亚环境分析中可以发挥的作用。