Department of Biostatistics, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Basic and Molecular Epidemiology of Gastrointestinal Disorders Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Arch Iran Med. 2024 Oct 1;27(10):551-562. doi: 10.34172/aim.31269.
Metabolic dysfunction-associated steatotic liver disease (MASLD) represents a significant global health burden without established curative therapies. Early detection and preventive strategies are crucial for effective MASLD management. This study aimed to develop and validate machine-learning (ML) algorithms for accurate MASLD screening in a geographically diverse, large-scale population.
Data from the prospective Fasa Cohort Study, initiated in rural Fars province, Iran (March 2014), were employed for this purpose. The required data were collected using blood tests, questionnaires, liver ultrasonography, and physical examinations. A two-step approach identified key predictors from over 100 variables: (1) statistical selection using mean decrease Gini in random forest and (2) incorporation of clinical expertise for alignment with known MASLD risk factors. The hold-out validation approach (with a 70/30 train/validation split) was utilized, along with 5-fold cross-validation on the validation set. Logistic regression, Naïve Bayes, support vector machine, and light gradient-boosting machine (LightGBM) algorithms were compared for model construction with the same input variables based on area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy.
A total of 6,180 adults (52.7% female) were included in the study, categorized into 4816 non-MASLD and 1364 MASLD cases with a mean age (±standard deviation [SD]) of 48.12 (±9.61) and 49.47 (±9.15) years, respectively. Logistic regression outperformed other ML algorithms, achieving an accuracy of 0.88 (95% confidence interval [CI]: 0.86-0.89) and an AUC of 0.92 (95% CI: 0.90-0.93). Among more than 100 variables, the key predictors included waist circumference, body mass index (BMI), hip circumference, wrist circumference, alanine aminotransferase levels, cholesterol, glucose, high-density lipoprotein, and blood pressure.
Integration of ML in MASLD management holds significant promise, particularly in resource-limited rural settings. Additionally, the relative importance assigned to each predictor, particularly prominent contributors such as waist circumference and BMI, offers valuable insights into MASLD prevention, diagnosis, and treatment strategies.
代谢相关脂肪性肝病(MASLD)是一种全球重大健康负担,目前尚无有效的治疗方法。早期发现和预防策略对于有效管理 MASLD至关重要。本研究旨在开发和验证用于在地理分布广泛的大规模人群中进行 MASLD 筛查的机器学习(ML)算法。
该研究使用来自伊朗法尔斯省农村 Fasa 队列研究(2014 年 3 月开始)的数据。使用血液检查、问卷、肝脏超声和体格检查收集所需数据。采用两步法从 100 多个变量中确定关键预测因子:(1)随机森林中的基尼均值减少的统计选择;(2)纳入临床专业知识以与已知的 MASLD 风险因素保持一致。采用留一验证法(训练/验证分割比例为 70/30)和验证集上的 5 折交叉验证。比较了逻辑回归、朴素贝叶斯、支持向量机和轻梯度提升机(LightGBM)算法,这些算法基于接受者操作特征曲线下面积(AUC)、灵敏度、特异性、阳性预测值(PPV)、阴性预测值(NPV)和准确性,使用相同的输入变量进行模型构建。
共纳入 6180 名成年人(52.7%为女性),分为 4816 名非 MASLD 和 1364 名 MASLD 病例,平均年龄(±标准差)分别为 48.12(±9.61)和 49.47(±9.15)岁。逻辑回归优于其他 ML 算法,准确率为 0.88(95%置信区间:0.86-0.89),AUC 为 0.92(95%置信区间:0.90-0.93)。在 100 多个变量中,关键预测因子包括腰围、体重指数(BMI)、臀围、手腕周长、丙氨酸氨基转移酶水平、胆固醇、血糖、高密度脂蛋白和血压。
在 MASLD 管理中整合 ML 具有重要意义,特别是在资源有限的农村环境中。此外,每个预测因子的相对重要性,特别是腰围和 BMI 等突出贡献因素,为 MASLD 的预防、诊断和治疗策略提供了有价值的见解。