基于常规临床数据应用可解释机器学习模型预测慢性乙型肝炎患者HBV相关肝硬化的危险因素：一项回顾性队列研究

Application of Interpretable Machine Learning Models to Predict the Risk Factors of HBV-Related Liver Cirrhosis in CHB Patients Based on Routine Clinical Data: A Retrospective Cohort Study.

作者信息

Xia Wei, Tan Yafeng, Mei Bing, Zhou Yizheng, Tan Jufang, Pubu Zhaxi, Sang Bu, Jiang Tao

机构信息

Department of Laboratory Medicine, Jingzhou Hospital Affiliated to Yangtze University, Jingzhou, Hubei, People's Republic of China.

Center for Scientific Research and Medical Transformation, Jingzhou Hospital Affiliated to Yangtze University, Hubei, People's Republic of China.

出版信息

J Med Virol. 2025 Mar;97(3):e70302. doi: 10.1002/jmv.70302.

DOI:10.1002/jmv.70302

PMID:40105097

Abstract

Chronic hepatitis B (CHB) infection represents a significant global public health issue, often leading to hepatitis B virus (HBV)-related liver cirrhosis (HBV-LC) with poor prognoses. Early identification of HBV-LC risk is essential for timely intervention. This study develops and compares nine machine learning (ML) models to predict HBV-LC risk in CHB patients using routine clinical and laboratory data. A retrospective analysis was conducted involving 777 CHB patients, with 50.45% (392/777) progressing to HBV-LC. Admission data consisted of 52 clinical and laboratory variables, with missing values addressed using multiple imputation. Feature selection utilized Least Absolute Shrinkage and Selection Operator (LASSO) regression and the Boruta algorithm, identifying 24 key variables. The evaluated ML models included XGBoost, logistic regression (LR), LightGBM, random forest (RF), AdaBoost, Gaussian naive Bayes (GNB), multilayer perceptron (MLP), support vector machine (SVM), and k-nearest neighbors (KNN). The data set was partitioned into an 80% training set (n = 621) and a 20% independent testing set (n = 156). Cross-validation (CV) facilitated hyperparameter tuning and internal validation of the optimal model. Performance metrics included the area under the receiver operating characteristic curve (AUC), Brier score, accuracy, sensitivity, specificity, and F1 score. The RF model demonstrated superior performance, with AUCs of 0.992 (training) and 0.907 (validation), while the reconstructed model achieved AUCs of 0.944 (training) and 0.945 (validation), maintaining an AUC of 0.863 in the testing set. Calibration curves confirmed a strong alignment between observed and predicted probabilities. Decision curve analysis indicated that the RF model provided the highest net benefit across threshold probabilities. The SHAP algorithm identified RPR, PLT, HBV DNA, ALT, and TBA as critical predictors. This interpretable ML model enhances early HBV-LC prediction and supports clinical decision-making in resource-limited settings.

摘要

慢性乙型肝炎（CHB）感染是一个重大的全球公共卫生问题，常常导致预后不良的乙型肝炎病毒（HBV）相关肝硬化（HBV-LC）。早期识别HBV-LC风险对于及时干预至关重要。本研究开发并比较了九种机器学习（ML）模型，以利用常规临床和实验室数据预测CHB患者的HBV-LC风险。对777例CHB患者进行了回顾性分析，其中50.45%（392/777）进展为HBV-LC。入院数据包括52个临床和实验室变量，使用多重插补处理缺失值。特征选择采用最小绝对收缩和选择算子（LASSO）回归和Boruta算法，识别出24个关键变量。评估的ML模型包括XGBoost、逻辑回归（LR）、LightGBM、随机森林（RF）、AdaBoost、高斯朴素贝叶斯（GNB）、多层感知器（MLP）、支持向量机（SVM）和k近邻（KNN）。数据集被划分为80%的训练集（n = 621）和20%的独立测试集（n = 156）。交叉验证（CV）有助于超参数调整和最优模型的内部验证。性能指标包括受试者操作特征曲线下面积（AUC）、布里尔评分、准确性、敏感性、特异性和F1评分。RF模型表现出卓越的性能，训练集的AUC为0.992，验证集的AUC为0.907，而重建模型的训练集AUC为0.944，验证集AUC为0.945，测试集的AUC保持在0.863。校准曲线证实观察到的概率与预测概率之间有很强的一致性。决策曲线分析表明，RF模型在阈值概率范围内提供了最高的净效益。SHAP算法将RPR、PLT、HBV DNA、ALT和TBA识别为关键预测因子。这种可解释的ML模型增强了对HBV-LC的早期预测，并支持资源有限环境下的临床决策。