College of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, PR China; College of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, PR China.
College of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, PR China.
Comput Methods Programs Biomed. 2022 Nov;226:107088. doi: 10.1016/j.cmpb.2022.107088. Epub 2022 Aug 28.
Traditional hypertension and hyperlipidemia prediction models suffer from uneven modeling data sources, small sample sizes, and a lack of uniform standards for the index system, resulting in the model failing to fulfill clinical applications. To address this issue, this work will offer DHDIP, an interpretable hypertension and hyperlipidemia prediction model based on EMR data.
First, we will select massive high-dimensional, unstructured EMR data as a unified modeling data source, and propose a pre-processing algorithm for EMR data to solve the problem that EMR data cannot be directly processed by machine learning algorithms. Second, a variety of mainstream models such as XGBoost, CatBoost, and RandomForest are selected for modeling, and the best adaptation algorithms are identified by performance comparison. Finally, the SHAP framework was introduced into the DHDIP model, thus identifying the main factors contributing to hypertension and hyperlipidemia, effectively enhancing the interpretability of the model.
The DHDIP model's MSE value is 0.0285, and its LOSS value is 0.0054, both of which are better than previous studies.
The model balances performance and interpretability. Multi-objective learning allows for a more thorough analysis and prediction of the condition, which not only lowers the cost of disease prediction but also aids physicians in clinical diagnosis. In addition, the datasets and source code are available from this link: https://github.com/Xiaoyao-Jia/DHDIP.
传统的高血压和高血脂预测模型存在建模数据源不均衡、样本量小以及指标体系缺乏统一标准等问题,导致模型无法满足临床应用的需求。针对这一问题,本研究提出了基于电子病历数据的可解释高血压和高血脂预测模型 DHDIP。
首先,我们将选择海量的高维、非结构化的电子病历数据作为统一的建模数据源,并提出一种电子病历数据的预处理算法,以解决电子病历数据无法直接被机器学习算法处理的问题。其次,我们选择了 XGBoost、CatBoost 和 RandomForest 等多种主流模型进行建模,并通过性能比较确定最佳的适应算法。最后,我们将 SHAP 框架引入 DHDIP 模型中,从而识别出导致高血压和高血脂的主要因素,有效提高了模型的可解释性。
DHDIP 模型的 MSE 值为 0.0285,LOSS 值为 0.0054,均优于以往的研究。
该模型在性能和可解释性之间取得了平衡。多目标学习可以更彻底地分析和预测病情,不仅降低了疾病预测的成本,还可以帮助医生进行临床诊断。此外,数据集和源代码可以从以下链接获取:https://github.com/Xiaoyao-Jia/DHDIP。