基于随机森林的非侵入性参数预测代谢综合征的数据平衡方法的效果。

The effect of data balancing approaches on the prediction of metabolic syndrome using non-invasive parameters based on random forest.

机构信息

School of Public Health, Bam University of Medical Sciences, Bam, Iran.

Research Center for Food Hygiene and Safety, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran.

出版信息

BMC Bioinformatics. 2024 Jan 11;25(1):18. doi: 10.1186/s12859-024-05633-9.

DOI:10.1186/s12859-024-05633-9

PMID:38212697

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10782700/

Abstract

BACKGROUND

Metabolic syndrome (MetS) is a cluster of metabolic abnormalities (including obesity, insulin resistance, hypertension, and dyslipidemia), which can be used to identify at-risk populations for diabetes and cardiovascular diseases, the main causes of morbidity and mortality worldwide. The achievement of a simple approach for diagnosing MetS without needing biochemical tests is so valuable. The present study aimed to predict MetS using non-invasive features based on a successful random forest learning algorithm. Also, to deal with the problem of data imbalance that naturally exists in this type of data, the effect of two different data balancing approaches, including the Synthetic Minority Over-sampling Technique (SMOTE) and Random Splitting data balancing (SplitBal), on model performance is investigated.

RESULTS

The most important determinant for MetS prediction was waist circumference. Applying a random forest learning algorithm to imbalanced data, the trained models reach 86.9% and 79.4% accuracies and 37.1% and 38.2% sensitivities in men and women, respectively. However, by applying the SplitBal data balancing technique, the best results were obtained, and despite that the accuracy of the trained models decreased by 7.8% and 11.3%, but their sensitivity improved significantly to 82.3% and 73.7% in men and women, respectively.

CONCLUSIONS

The random forest learning method, along with data balancing techniques, especially SplitBal, could create MetS prediction models with promising results that can be applied as a useful prognostic tool in health screening programs.

摘要

背景

代谢综合征（MetS）是一组代谢异常（包括肥胖、胰岛素抵抗、高血压和血脂异常），可用于识别糖尿病和心血管疾病的高危人群，这是全球发病率和死亡率的主要原因。实现一种无需生化检测即可诊断 MetS 的简单方法非常有价值。本研究旨在使用基于成功随机森林学习算法的无创特征来预测 MetS。此外，为了解决此类数据中存在的固有数据不平衡问题，研究了两种不同的数据平衡方法，包括合成少数过采样技术（SMOTE）和随机分割数据平衡（SplitBal），对模型性能的影响。

结果

MetS 预测的最重要决定因素是腰围。应用随机森林学习算法对不平衡数据进行处理，训练后的模型在男性和女性中的准确率分别达到 86.9%和 79.4%，灵敏度分别达到 37.1%和 38.2%。然而，通过应用 SplitBal 数据平衡技术，可以获得最佳结果，尽管训练模型的准确率分别下降了 7.8%和 11.3%，但它们的灵敏度分别显著提高到男性和女性的 82.3%和 73.7%。