文献检索，用中文搜 PubMed

BACKGROUND

The use of visible-near infrared (vis-NIR) spectroscopy for rapid soil characterisation has gained a lot of interest in recent times. Soil spectra absorbance from the visible-infrared range can be calibrated using regression models to predict a set of soil properties. The accuracy of these regression models relies heavily on the calibration set. The optimum sample size and the overall sample representativeness of the dataset could further improve the model performance. However, there is no guideline on which sampling method should be used under different size of datasets.

METHODS

Here, we show different sampling algorithms performed differently under different data size and different regression models (Cubist regression tree and Partial Least Square Regression (PLSR)). We analysed the effect of three sampling algorithms: Kennard-Stone (KS), conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM) against random sampling on the prediction of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH) on three datasets. These datasets have different coverages: a European continental dataset (LUCAS, = 5,639), a regional dataset from Australia (Geeves, = 379), and a local dataset from New South Wales, Australia (Hillston, = 384). Calibration sample sizes ranging from 50 to 3,000 were derived and tested for the continental dataset; and from 50 to 200 samples for the regional and local datasets.

RESULTS

Overall, the PLSR gives a better prediction in comparison to the Cubist model for the prediction of various soil properties. It is also less prone to the choice of sampling algorithm. The KM algorithm is more representative in the larger dataset up to a certain calibration sample size. The KS algorithm appears to be more efficient (as compared to random sampling) in small datasets; however, the prediction performance varied a lot between soil properties. The cLHS sampling algorithm is the most robust sampling method for multiple soil properties regardless of the sample size.

DISCUSSION

Our results suggested that the optimum calibration sample size relied on how much generalization the model had to create. The use of the sampling algorithm is beneficial for larger datasets than smaller datasets where only small improvements can be made. KM is suitable for large datasets, KS is efficient in small datasets but results can be variable, while cLHS is less affected by sample size.

BACKGROUND

METHODS

RESULTS

DISCUSSION

背景

近年来，可见 - 近红外（vis - NIR）光谱技术用于快速土壤特性表征受到了广泛关注。利用回归模型可以对土壤在可见 - 红外波段的光谱吸光度进行校准，从而预测一系列土壤特性。这些回归模型的准确性在很大程度上依赖于校准集。数据集的最佳样本量和整体样本代表性能够进一步提升模型性能。然而，对于不同规模的数据集应采用何种抽样方法，目前尚无指导原则。

方法

在此，我们展示了不同的抽样算法在不同数据规模和不同回归模型（Cubist回归树和偏最小二乘回归（PLSR））下表现各异。我们分析了三种抽样算法：肯纳德 - 斯通（KS）算法、条件拉丁超立方抽样（cLHS）算法和k均值聚类（KM）算法，与随机抽样相比，它们对三个数据集上多达五种不同土壤特性（砂、黏土、碳含量、阳离子交换容量和pH值）的预测效果。这些数据集具有不同的覆盖范围：一个欧洲大陆数据集（LUCAS，n = 5639）、一个来自澳大利亚的区域数据集（Geeves，n = 379）以及一个来自澳大利亚新南威尔士州的本地数据集（Hillston，n = 384）。针对大陆数据集，推导并测试了校准样本量从50到3000的情况；针对区域和本地数据集，校准样本量范围为50到200个样本。

结果

总体而言，在预测各种土壤特性方面，与Cubist模型相比，PLSR给出了更好的预测结果。它对抽样算法的选择也不太敏感。在达到一定校准样本量之前，KM算法在较大数据集中更具代表性。在小数据集中，KS算法（与随机抽样相比）似乎更高效；然而，不同土壤特性之间的预测性能差异很大。无论样本量大小，cLHS抽样算法对于多种土壤特性而言是最稳健的抽样方法。