利用大规模数据集对脂肪肝疾病进行当前就诊和下次就诊预测：模型开发与性能比较

Current-Visit and Next-Visit Prediction for Fatty Liver Disease With a Large-Scale Dataset: Model Development and Performance Comparison.

作者信息

Wu Cheng-Tse, Chu Ta-Wei, Jang Jyh-Shing Roger

机构信息

Department of Computer Science & Information Engineering, National Taiwan University, Taipei, Taiwan.

Department of Obstetrics and Gynecology, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan.

出版信息

JMIR Med Inform. 2021 Aug 12;9(8):e26398. doi: 10.2196/26398.

DOI:10.2196/26398

PMID:34387552

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8391752/

Abstract

BACKGROUND

Fatty liver disease (FLD) arises from the accumulation of fat in the liver and may cause liver inflammation, which, if not well controlled, may develop into liver fibrosis, cirrhosis, or even hepatocellular carcinoma.

OBJECTIVE

We describe the construction of machine-learning models for current-visit prediction (CVP), which can help physicians obtain more information for accurate diagnosis, and next-visit prediction (NVP), which can help physicians provide potential high-risk patients with advice to effectively prevent FLD.

METHODS

The large-scale and high-dimensional dataset used in this study comes from Taipei MJ Health Research Foundation in Taiwan. We used one-pass ranking and sequential forward selection (SFS) for feature selection in FLD prediction. For CVP, we explored multiple models, including k-nearest-neighbor classifier (KNNC), Adaboost, support vector machine (SVM), logistic regression (LR), random forest (RF), Gaussian naïve Bayes (GNB), decision trees C4.5 (C4.5), and classification and regression trees (CART). For NVP, we used long short-term memory (LSTM) and several of its variants as sequence classifiers that use various input sets for prediction. Model performance was evaluated based on two criteria: the accuracy of the test set and the intersection over union/coverage between the features selected by one-pass ranking/SFS and by domain experts. The accuracy, precision, recall, F-measure, and area under the receiver operating characteristic curve were calculated for both CVP and NVP for males and females, respectively.

RESULTS

After data cleaning, the dataset included 34,856 and 31,394 unique visits respectively for males and females for the period 2009-2016. The test accuracy of CVP using KNNC, Adaboost, SVM, LR, RF, GNB, C4.5, and CART was respectively 84.28%, 83.84%, 82.22%, 82.21%, 76.03%, 75.78%, and 75.53%. The test accuracy of NVP using LSTM, bidirectional LSTM (biLSTM), Stack-LSTM, Stack-biLSTM, and Attention-LSTM was respectively 76.54%, 76.66%, 77.23%, 76.84%, and 77.31% for fixed-interval features, and was 79.29%, 79.12%, 79.32%, 79.29%, and 78.36%, respectively, for variable-interval features.

CONCLUSIONS

This study explored a large-scale FLD dataset with high dimensionality. We developed FLD prediction models for CVP and NVP. We also implemented efficient feature selection schemes for current- and next-visit prediction to compare the automatically selected features with expert-selected features. In particular, NVP emerged as more valuable from the viewpoint of preventive medicine. For NVP, we propose use of feature set 2 (with variable intervals), which is more compact and flexible. We have also tested several variants of LSTM in combination with two feature sets to identify the best match for male and female FLD prediction. More specifically, the best model for males was Stack-LSTM using feature set 2 (with 79.32% accuracy), whereas the best model for females was LSTM using feature set 1 (with 81.90% accuracy).

摘要

背景

脂肪性肝病（FLD）源于肝脏中脂肪的积累，可能导致肝脏炎症，若控制不佳，可能发展为肝纤维化、肝硬化，甚至肝细胞癌。

目的

我们描述了用于当前就诊预测（CVP）的机器学习模型构建，其可帮助医生获取更多信息以进行准确诊断；以及用于下次就诊预测（NVP）的机器学习模型构建，其可帮助医生为潜在高危患者提供建议以有效预防FLD。

方法

本研究中使用的大规模高维数据集来自台湾台北美兆健康研究基金会。我们在FLD预测中使用单遍排序和顺序前向选择（SFS）进行特征选择。对于CVP，我们探索了多种模型，包括k近邻分类器（KNNC）、Adaboost、支持向量机（SVM）、逻辑回归（LR）、随机森林（RF）、高斯朴素贝叶斯（GNB）、决策树C4.5（C4.5）和分类回归树（CART）。对于NVP，我们使用长短期记忆网络（LSTM）及其几种变体作为序列分类器，使用各种输入集进行预测。基于两个标准评估模型性能：测试集的准确率以及单遍排序/SFS选择的特征与领域专家选择的特征之间的交并比/覆盖率。分别计算了男性和女性CVP和NVP的准确率、精确率、召回率、F1值以及受试者工作特征曲线下面积。

结果

数据清理后，该数据集在2009 - 2016年期间分别包含男性34,856次和女性31,394次独特就诊。使用KNNC、Adaboost、SVM、LR、RF、GNB、C4.5和CART进行CVP的测试准确率分别为84.28%、83.84%、82.22%、82.21%、76.03%、75.78%和75.53%。对于固定间隔特征，使用LSTM、双向LSTM（biLSTM）、堆叠LSTM、堆叠双向LSTM和注意力LSTM进行NVP的测试准确率分别为76.54%、76.66%、77.23%、76.84%和77.31%；对于可变间隔特征，测试准确率分别为79.29%、79.12%、79.32%、79.29%和78.36%。

结论

本研究探索了一个大规模的高维FLD数据集。我们开发了用于CVP和NVP的FLD预测模型。我们还为当前和下次就诊预测实施了有效的特征选择方案，以比较自动选择的特征与专家选择的特征。特别是，从预防医学的角度来看，NVP显得更有价值。对于NVP，我们建议使用特征集2（可变间隔），它更紧凑且灵活。我们还测试了LSTM的几种变体与两个特征集的组合，以确定男性和女性FLD预测的最佳匹配。更具体地说，男性的最佳模型是使用特征集2的堆叠LSTM（准确率为79.32%），而女性的最佳模型是使用特征集1的LSTM（准确率为81.90%）。