Barua Limon, Zou Bo, Zhou Yan, Liu Yulin
Department of Civil, Materials, and Environmental Engineering, University of Illinois Chicago, Chicago, USA.
Department of Civil and Environmental Engineering, University of California, Berkeley, USA.
Transportation (Amst). 2023;50(2):437-476. doi: 10.1007/s11116-021-10250-z. Epub 2021 Dec 2.
Despite the rapid growth of online shopping and research interest in the relationship between online and in-store shopping, national-level modeling and investigation of the demand for online shopping with a prediction focus remain limited in the literature. This paper differs from prior work and leverages two recent releases of the U.S. National Household Travel Survey (NHTS) data for 2009 and 2017 to develop machine learning (ML) models, specifically gradient boosting machine (GBM), for predicting household-level online shopping purchases. The NHTS data allow for not only conducting nationwide investigation but also at the level of households, which is more appropriate than at the individual level given the connected consumption and shopping needs of members in a household. We follow a systematic procedure for model development including employing Recursive Feature Elimination algorithm to select input variables (features) in order to reduce the risk of model overfitting and increase model explainability. Among several ML models, GBM is found to yield the best prediction accuracy. Extensive post-modeling investigation is conducted in a comparative manner between 2009 and 2017, including quantifying the importance of each input variable in predicting online shopping demand, and characterizing value-dependent relationships between demand and the input variables. In doing so, two latest advances in machine learning techniques, namely Shapley value-based feature importance and Accumulated Local Effects plots, are adopted to overcome inherent drawbacks of the popular techniques in current ML modeling. The modeling and investigation are performed at the national level, with a number of findings obtained. The models developed and insights gained can be used for online shopping-related freight demand generation and may also be considered for evaluating the potential impact of relevant policies on online shopping demand.
尽管在线购物迅速增长,且学术界对线上与线下购物之间的关系兴趣浓厚,但在文献中,以预测为重点的国家级在线购物需求建模与调查仍然有限。本文与先前的研究不同,利用美国国家家庭旅行调查(NHTS)2009年和2017年的两个最新版本数据,开发机器学习(ML)模型,特别是梯度提升机(GBM),用于预测家庭层面的在线购物支出。NHTS数据不仅允许进行全国范围的调查,还能在家庭层面进行调查,鉴于家庭中成员的关联消费和购物需求,这比在个人层面进行调查更为合适。我们遵循系统的模型开发程序,包括采用递归特征消除算法来选择输入变量(特征),以降低模型过拟合的风险并提高模型的可解释性。在多个ML模型中,发现GBM具有最佳的预测准确性。以比较的方式在2009年和2017年之间进行了广泛的建模后调查,包括量化每个输入变量在预测在线购物需求中的重要性,以及刻画需求与输入变量之间的价值依赖关系。在此过程中,采用了机器学习技术的两项最新进展,即基于Shapley值的特征重要性和累积局部效应图,以克服当前ML建模中常用技术的固有缺陷。建模和调查在国家层面进行,并获得了一些研究结果。所开发的模型和获得的见解可用于生成与在线购物相关的货运需求,也可用于评估相关政策对在线购物需求的潜在影响。