Zhang Peng, Zhu Bifan, Chen Xing, Wang Linan
School of Humanities, Shanghai Institute of Technology, Haiquan Road 100, Fengxian District, Shanghai, 201418, China.
Shanghai Health Development Research Center, Room 804, No 1477, West Beijing Road, Jing'an District, Shanghai, 20040, China.
Sci Rep. 2025 May 8;15(1):16006. doi: 10.1038/s41598-025-99546-z.
Rapidly increasing healthcare spending globally is significantly driven by high-need, high-cost (HNHC) patients, who account for the top 5% of annual healthcare costs but over half of total expenditures. The programs targeting existing HNHC patients have shown limited long-term impact, and research predicting HNHC pediatric patients in China is limited. There is an urgent need to establish a specific, valid, and reliable prediction model using machine-learning-based methods to identify potential HNHC pediatric patients and implement proactive interventions before high costs arise. This study used a 7-year retrospective cohort dataset from two administrative databases in Shanghai, covering pediatric patients under 18 years. The machine-learning-based models were developed to predict HNHC status using logistic regression, k-nearest neighbors (KNN), random forest (RF), multi-layer perceptron (MLP), and Naive Bayes. This study divided the data from 2021-2022 into 70:30 as a training set and a test set, with the internal class balancing approach of the Synthetic Minority Over-sampling Technique (SMOTE). A grid search strategy was employed with k-fold cross-validation to optimize hyperparameters. Model performance was assessed by 5 metrics: Receiver Operating Characteristic-Area Under Curve (ROC-AUC), accuracy, sensitivity, specificity, and F1 score. The external validation from 2022-2023 data and the internal validation using different train-test ratios (80:20 and 90:10) were used to assess the robustness of the trained models. Among the 91,882 hospitalized children included in 2021, significant differences were found in socioeconomics, disease, healthcare service utilization, previous healthcare expenditure, and hospital characteristics between the HNHC and non-HNHC groups. The hospitalization costs for HNHC pediatric patients accounted for over 35% of total spending. The MLP model demonstrated the highest predictive performance (ROC-AUC: 0.872), followed by RF (0.869), KNN (0.836), and naive Bayes (0.828). The most important predictive factors included length of stay, number of hospitalizations, previous HNHC status, age, and presence of Top 20 HNHC diseases. MLP showed robustness as the most efficient model in external validation (ROC-AUC: 0.843) and internal validation using different train-test ratios (ROC-AUC: 0.826 in 80:20 ratio; 0.807 in 90:10 ratio). Machine learning models, particularly MLP, effectively predict HNHC pediatric patients, providing a basis for early identification of HNHC and proactive healthcare interventions into clinical practice. This approach can also assist policymakers and payers in optimizing healthcare resource allocation, controlling healthcare costs, and improving patient outcomes.
全球医疗保健支出的快速增长主要由高需求、高成本(HNHC)患者推动,这些患者占年度医疗保健成本的前5%,但支出总额超过一半。针对现有HNHC患者的项目长期影响有限,而在中国预测HNHC儿科患者的研究也很有限。迫切需要使用基于机器学习的方法建立一个特定、有效且可靠的预测模型,以识别潜在的HNHC儿科患者,并在高成本出现之前实施积极干预。本研究使用了上海两个行政数据库的7年回顾性队列数据集,涵盖18岁以下的儿科患者。基于机器学习的模型通过逻辑回归、k近邻(KNN)、随机森林(RF)、多层感知器(MLP)和朴素贝叶斯来预测HNHC状态。本研究将2021年至2022年的数据按70:30分为训练集和测试集,采用合成少数过采样技术(SMOTE)的内部类平衡方法。采用网格搜索策略和k折交叉验证来优化超参数。通过5个指标评估模型性能:受试者工作特征曲线下面积(ROC-AUC)、准确率、灵敏度、特异性和F1分数。使用2022年至2023年数据的外部验证以及使用不同训练-测试比例(80:20和90:10)的内部验证来评估训练模型的稳健性。在2021年纳入的91882名住院儿童中,HNHC组和非HNHC组在社会经济、疾病、医疗服务利用、既往医疗支出和医院特征方面存在显著差异。HNHC儿科患者的住院费用占总支出的35%以上。MLP模型表现出最高的预测性能(ROC-AUC:0.872),其次是RF(0.869)、KNN(0.836)和朴素贝叶斯(0.828)。最重要的预测因素包括住院时间、住院次数、既往HNHC状态、年龄以及前20种HNHC疾病的存在情况。MLP在外部验证(ROC-AUC:0.843)和使用不同训练-测试比例的内部验证(80:20比例下ROC-AUC:0.826;90:10比例下ROC-AUC:0.807)中表现出作为最有效模型的稳健性。机器学习模型,尤其是MLP,能够有效预测HNHC儿科患者,为在临床实践中早期识别HNHC和实施积极的医疗干预提供了依据。这种方法还可以帮助政策制定者和支付方优化医疗资源分配、控制医疗成本并改善患者结局。