基于电子病历数据的高血压和高血脂预测可解释模型（DHDIP）

DHDIP: An interpretable model for hypertension and hyperlipidemia prediction based on EMR data.

机构信息

College of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, PR China; College of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, PR China.

College of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, PR China.

出版信息

Comput Methods Programs Biomed. 2022 Nov;226:107088. doi: 10.1016/j.cmpb.2022.107088. Epub 2022 Aug 28.

DOI:10.1016/j.cmpb.2022.107088

PMID:36096022

Abstract

BACKGROUND AND OBJECTIVE

Traditional hypertension and hyperlipidemia prediction models suffer from uneven modeling data sources, small sample sizes, and a lack of uniform standards for the index system, resulting in the model failing to fulfill clinical applications. To address this issue, this work will offer DHDIP, an interpretable hypertension and hyperlipidemia prediction model based on EMR data.

METHODS

First, we will select massive high-dimensional, unstructured EMR data as a unified modeling data source, and propose a pre-processing algorithm for EMR data to solve the problem that EMR data cannot be directly processed by machine learning algorithms. Second, a variety of mainstream models such as XGBoost, CatBoost, and RandomForest are selected for modeling, and the best adaptation algorithms are identified by performance comparison. Finally, the SHAP framework was introduced into the DHDIP model, thus identifying the main factors contributing to hypertension and hyperlipidemia, effectively enhancing the interpretability of the model.

RESULTS

The DHDIP model's MSE value is 0.0285, and its LOSS value is 0.0054, both of which are better than previous studies.

CONCLUSION

The model balances performance and interpretability. Multi-objective learning allows for a more thorough analysis and prediction of the condition, which not only lowers the cost of disease prediction but also aids physicians in clinical diagnosis. In addition, the datasets and source code are available from this link: https://github.com/Xiaoyao-Jia/DHDIP.

摘要

背景与目的

传统的高血压和高血脂预测模型存在建模数据源不均衡、样本量小以及指标体系缺乏统一标准等问题，导致模型无法满足临床应用的需求。针对这一问题，本研究提出了基于电子病历数据的可解释高血压和高血脂预测模型 DHDIP。

方法

首先，我们将选择海量的高维、非结构化的电子病历数据作为统一的建模数据源，并提出一种电子病历数据的预处理算法，以解决电子病历数据无法直接被机器学习算法处理的问题。其次，我们选择了 XGBoost、CatBoost 和 RandomForest 等多种主流模型进行建模，并通过性能比较确定最佳的适应算法。最后，我们将 SHAP 框架引入 DHDIP 模型中，从而识别出导致高血压和高血脂的主要因素，有效提高了模型的可解释性。