样本大小和机器学习算法对数字土壤养分制图精度的影响。

Influence of sample size and machine learning algorithms on digital soil nutrient mapping accuracy.

作者信息

Dash Prava Kiran, Ferhatoglu Caner, Miller Bradley A, Panigrahi Niranjan, Mishra Antaryami

机构信息

Department of Soil Science and Agricultural Chemistry, College of Agriculture, Odisha University of Agriculture and Technology, Bhubaneswar, Odisha, 751003, India.

Regional Research and Technology Transfer Station, Mahisapat, Dhenkanal, Odisha, 759013, India.

出版信息

Environ Monit Assess. 2025 Aug 8;197(9):996. doi: 10.1007/s10661-025-14322-w.

The objective of this study is to evaluate and compare the performance of different machine learning (ML) algorithms, viz., multi-layer perceptron (MLP), random forest (RF), extra trees regressor (ETR), CatBoost, and gradient boost (GB), considering the impact of variation in sample sizes for the prediction of soil nutrients. In this context, this study evaluates the impact of sample size on the prediction performance of five ML algorithms for mapping 14 soil properties, including key soil physico-chemical properties (soil organic carbon, pH, and electrical conductivity), and multiple macro (available nitrogen, phosphorus, potassium, calcium, magnesium, and sulfur) and micronutrients (available iron, manganese, copper, zinc, and boron) at a geographical extent of 8303 km located in eastern India. A total of 1024 surface soil samples were collected, of which 800 were used for model training, while the remaining 224 were reserved for independent validation (IV) of the resultant maps. The original training data set (800 samples) was reduced by random selection into six different sample sizes, i.e., 800, 400, 200, 100, 50, and 25. An exhaustive set of 574 environmental variables, derived from digital terrain derivatives and Sentinel-2 satellite imagery, was used as predictors. Popular statistical indicators, such as Lin's concordance correlation coefficient (CCC) and root mean squared error (RMSE) were employed to evaluate the predictive capability of the algorithms under different sample size scenarios. The results showed that prediction accuracy and reliability of prediction performance across multiple target variables improved with sample size. However, beyond a certain point, improvement in predictive performance became substantially negligible compared to the efforts in additional sampling. All the ML algorithms performed well (mean IV-CCC varied between 0.26 and 0.64 for different soil properties) to increase in sample size, except MLP, which exhibited a poorer prediction performance (mean IV-CCC varied between 0.14 and 0.29 for different soil properties). Micronutrients in general responded well to the increase in sample sizes. The uncertainty analysis revealed that increasing sample size generally reduced prediction uncertainty, though the extent varied by soil property and ML algorithm. Concisely, the results presented herein showed an effective manner of selecting an appropriate sample size and a suitable ML algorithm to predict multiple soil nutrients accurately, which would be recommended to achieve optimal accuracy for a project.

本研究的目的是评估和比较不同机器学习（ML）算法的性能，即多层感知器（MLP）、随机森林（RF）、极端随机树回归器（ETR）、CatBoost和梯度提升（GB），同时考虑样本量变化对土壤养分预测的影响。在此背景下，本研究评估了样本量对五种ML算法预测性能的影响，这些算法用于绘制印度东部8303平方公里地理范围内14种土壤属性的地图，包括关键土壤理化属性（土壤有机碳、pH值和电导率）以及多种大量元素（有效氮、磷、钾、钙、镁和硫）和微量元素（有效铁、锰、铜、锌和硼）。总共收集了1024个表层土壤样本，其中800个用于模型训练，其余224个留作所得地图的独立验证（IV）。原始训练数据集（800个样本）通过随机选择减少为六种不同的样本量，即800、400、200、100、50和25。一组由数字地形导数和哨兵2号卫星图像得出的574个环境变量被用作预测因子。常用统计指标，如林氏一致性相关系数（CCC）和均方根误差（RMSE），用于评估不同样本量情况下算法的预测能力。结果表明，随着样本量的增加，多个目标变量的预测准确性和预测性能的可靠性得到提高。然而，超过某一点后，与额外采样的努力相比，预测性能的提高变得微不足道。除MLP外，所有ML算法在样本量增加时表现良好（不同土壤属性的平均IV-CCC在0.26至0.64之间变化），MLP的预测性能较差（不同土壤属性的平均IV-CCC在0.14至0.29之间变化）。一般来说，微量元素对样本量的增加反应良好。不确定性分析表明，增加样本量通常会降低预测不确定性，尽管其程度因土壤属性和ML算法而异。简而言之，本文给出的结果表明了一种有效选择合适样本量和合适ML算法以准确预测多种土壤养分的方法，建议在项目中采用该方法以实现最佳准确性。

Influence of sample size and machine learning algorithms on digital soil nutrient mapping accuracy.

作者信息

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献