一种基于特征选择与改进随机森林模型集成的人口空间化方法。

A population spatialization method based on the integration of feature selection and an improved random forest model.

作者信息

Zhao Zhen, Guo Hongmei, Jiang Xueli, Zhang Ying, Lu Changjiang, Zhang Can, He Zonghang

机构信息

The Seismological Bureau of Sichuan Province, Chengdu, Sichuan, China.

Southwest Jiaotong University, Chengdu, Sichuan, China.

出版信息

PLoS One. 2025 Apr 3;20(4):e0321263. doi: 10.1371/journal.pone.0321263. eCollection 2025.

DOI:10.1371/journal.pone.0321263

PMID:40179342

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11968112/

Abstract

Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately.

摘要

确定人口的精确空间分布对于进行有效的城市规划、资源分配和应急救援规划至关重要。随机森林（RF）模型在人口空间化研究中被广泛使用。然而，人口分布特征的复杂性以及RF模型在处理不平衡数据集方面的局限性影响了人口预测的准确性。为了解决这些问题，本文提出了一种将特征选择与改进的随机森林相结合的人口空间化模型。首先，利用基于交叉验证的递归特征消除（RFECV）、最大信息系数（MIC）和平均精度下降（MDA）方法来选择人口分布特征因子。使用通过不同特征选择方法选择的特征子集构建随机森林，即MIC-RF、RFECV-RF和MDA-RF。随后，选择与准确率最高的模型相对应的特征因子作为最优特征子集，并将其作为输入数据用于模型构建。此外，考虑到人口空间分布的不平衡性，我们使用K均值++聚类算法对最优特征子集进行聚类，并使用自助采样方法从每个聚类中提取相同数量的数据，并将其与训练子集融合以构建改进的随机森林模型。基于该模型，生成了分辨率为500米的川南经济区人口空间分布数据集。最后，将本研究生成的人口数据集与WorldPop数据集进行比较和验证。结果表明，与基于所有因素的RF相比，利用特征选择方法在不同程度上提高了模型的准确率，其中MDA-RF的最低平均绝对百分比误差（MAPE）为0.174，最高决定系数（R2）为0.913。因此，使用MDA方法选择的特征因子被认为是最优特征子集。与MDA-RF相比，基于相同子集构建的改进RF的预测准确率提高了1.7%，这表明使用K均值++聚类算法改进随机森林的自助采样可以在一定程度上提高模型的准确率。与WorldPop数据集相比，使用所提方法预测结果的准确率有所提高。WorldPop数据集的平均相对误差（MRE）和均方根误差（RMSE）分别为57.24和23174.98，而所提方法的MRE和RMSE分别为25.00和15776.50。这意味着本文提出的方法能够更准确地模拟人口空间分布。