Suppr超能文献

利用电子病历数据构建机器学习模型的联合建模策略:以脑出血为例。

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.

机构信息

Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.

Department of Neurosurgery, West China Hospital of Sichuan University, Chengdu, Sichuan, People's Republic of China.

出版信息

BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.

Abstract

BACKGROUND

Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling.

METHODS

This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017-2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost).

RESULTS

Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938).

CONCLUSION

This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC.

摘要

背景

医学数据中的异常值和不平衡可能会影响机器学习模型的准确性。对于想要应用预测模型的医生来说,如何使用手头的数据来构建模型以及选择哪种模型是非常棘手的问题。因此,在建模时需要考虑异常值、不平衡数据、模型选择和参数调整。

方法

本研究采用了一种联合建模策略,包括异常值检测和去除、数据平衡、模型拟合和预测、性能评估。我们收集了 2017-2019 年四川省所有 ICH 患者入院的病历数据。使用临床和影像学变量构建模型,预测出院后 90 天的死亡率。我们使用堆叠集成学习将逻辑回归(LR)、随机森林(RF)、人工神经网络(ANN)、支持向量机(SVM)和 K-最近邻(KNN)模型进行组合。使用准确性、灵敏度、特异性、AUC、精度和 F1 得分来评估模型性能。最后,我们比较了联合建模策略的所有 84 种组合,包括使用和不使用交叉验证委员会滤波器(CVCF)的训练集、五种重采样技术(随机欠采样(RUS)、随机过采样(ROS)、自适应合成采样(ADASYN)、边界合成少数过采样技术(Borderline SMOTE)、合成少数过采样技术和编辑最近邻(SMOTEENN))和不重采样、七种模型(LR、RF、ANN、SVM、KNN、Stacking、AdaBoost)。

结果

在 4207 名 ICH 患者中,有 2909 名(69.15%)在出院后 90 天内存活,有 1298 名(30.85%)在出院后 90 天内死亡。除了灵敏度外,所有模型通过 CVCF 去除异常值后性能都有所提高。对于数据平衡处理,在不进行重采样的训练集的准确性、特异性和精度方面优于进行重采样的训练集。ROS 的 AUC 最好。在七种模型中,RF 的平均准确性、特异性、AUC 和精度最高。在 F1 得分方面,Stacking 表现最好。在所有 84 种联合建模策略组合中,有八种组合在准确性方面表现最好(0.816)。在灵敏度方面,表现最好的是 SMOTEENN + Stacking(0.662)。在特异性方面,表现最好的是 CVCF + KNN(0.987)。Stacking 和 AdaBoost 在 AUC(0.756)和 F1 得分(0.602)方面表现最好。在精度方面,表现最好的是 CVCF + SVM(0.938)。

结论

本研究提出了一种联合建模策略,包括异常值检测和去除、数据平衡、模型拟合和预测、性能评估,为想要建立自己模型的医生和研究人员提供了参考。本研究说明了异常值检测和去除对于机器学习的重要性,并表明集成学习可能是一种很好的建模策略。由于本研究中不平衡比(IR,多数类和少数类的比例)较低,我们没有发现重采样在准确性、特异性和精度方面对模型有任何改进,而 ROS 在 AUC 方面表现最好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fd5/9594939/c6babf71af52/12911_2022_2018_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验