Atsawarungruangkit Amporn, Laoveeravat Passisd, Promrat Kittichai
Division of Gastroenterology, Warren Alpert Medical School, Brown University, Providence, RI 02903, United States.
Division of Digestive Diseases and Nutrition, University of Kentucky College of Medicine, Lexington, KY 40536, United States.
World J Hepatol. 2021 Oct 27;13(10):1417-1427. doi: 10.4254/wjh.v13.i10.1417.
Non-alcoholic fatty liver disease (NAFLD) is the most common chronic liver disease, affecting over 30% of the United States population. Early patient identification using a simple method is highly desirable.
To create machine learning models for predicting NAFLD in the general United States population.
Using the NHANES 1988-1994. Thirty NAFLD-related factors were included. The dataset was divided into the training (70%) and testing (30%) datasets. Twenty-four machine learning algorithms were applied to the training dataset. The best-performing models and another interpretable model (, coarse trees) were tested using the testing dataset.
There were 3235 participants ( = 3235) that met the inclusion criteria. In the training phase, the ensemble of random undersampling (RUS) boosted trees had the highest F1 (0.53). In the testing phase, we compared selective machine learning models and NAFLD indices. Based on F1, the ensemble of RUS boosted trees remained the top performer (accuracy 71.1% and F1 0.56) followed by the fatty liver index (accuracy 68.8% and F1 0.52). A simple model (coarse trees) had an accuracy of 74.9% and an F1 of 0.33.
Not every machine learning model is complex. Using a simpler model such as coarse trees, we can create an interpretable model for predicting NAFLD with only two predictors: fasting C-peptide and waist circumference. Although the simpler model does not have the best performance, its simplicity is useful in clinical practice.
非酒精性脂肪性肝病(NAFLD)是最常见的慢性肝病,影响着超过30%的美国人口。使用简单方法对患者进行早期识别非常必要。
创建用于预测美国普通人群中NAFLD的机器学习模型。
使用1988 - 1994年的美国国家健康与营养检查调查(NHANES)。纳入了30个与NAFLD相关的因素。数据集被分为训练集(70%)和测试集(30%)。将24种机器学习算法应用于训练数据集。使用测试数据集对表现最佳的模型和另一个可解释模型(粗树模型)进行测试。
有3235名参与者(n = 3235)符合纳入标准。在训练阶段,随机欠采样(RUS)增强树的集成模型具有最高的F1值(0.53)。在测试阶段,我们比较了选择性机器学习模型和NAFLD指数。基于F1值,RUS增强树的集成模型仍然是表现最佳的(准确率71.1%,F1值0.56),其次是脂肪肝指数(准确率68.8%,F1值0.52)。一个简单模型(粗树模型)的准确率为74.9%,F1值为0.33。
并非每个机器学习模型都很复杂。使用像粗树模型这样更简单的模型,我们可以创建一个仅用两个预测变量(空腹C肽和腰围)来预测NAFLD的可解释模型。虽然这个更简单的模型没有最佳性能,但其简单性在临床实践中很有用。