Gul Sebnem, Ayturan Kubilay, Hardalaç Fırat
Department of Electrical and Electronics Engineering, Faculty of Engineering, Graduate School of Natural and Applied Sciences, Gazi University, Ankara 06570, Turkey.
J Pers Med. 2024 Jul 29;14(8):804. doi: 10.3390/jpm14080804.
Predicting type 2 diabetes mellitus (T2DM) by using phenotypic data with machine learning (ML) techniques has received significant attention in recent years. PyCaret, a low-code automated ML tool that enables the simultaneous application of 16 different algorithms, was used to predict T2DM by using phenotypic variables from the "Nurses' Health Study" and "Health Professionals' Follow-up Study" datasets. Ridge Classifier, Linear Discriminant Analysis, and Logistic Regression (LR) were the best-performing models for the male-only data subset. For the female-only data subset, LR, Gradient Boosting Classifier, and CatBoost Classifier were the strongest models. The AUC, accuracy, and precision were approximately 0.77, 0.70, and 0.70 for males and 0.79, 0.70, and 0.71 for females, respectively. The feature importance plot showed that family history of diabetes (famdb), never having smoked, and high blood pressure (hbp) were the most influential features in females, while famdb, hbp, and currently being a smoker were the major variables in males. In conclusion, PyCaret was used successfully for the prediction of T2DM by simplifying complex ML tasks. Gender differences are important to consider for T2DM prediction. Despite this comprehensive ML tool, phenotypic variables alone may not be sufficient for early T2DM prediction; genotypic variables could also be used in combination for future studies.
近年来,利用机器学习(ML)技术通过表型数据预测2型糖尿病(T2DM)受到了广泛关注。PyCaret是一种低代码自动化ML工具,能够同时应用16种不同算法,它被用于通过使用“护士健康研究”和“卫生专业人员随访研究”数据集中的表型变量来预测T2DM。岭分类器、线性判别分析和逻辑回归(LR)是仅针对男性数据子集表现最佳的模型。对于仅女性数据子集,LR、梯度提升分类器和CatBoost分类器是最强的模型。男性的AUC、准确率和精确率分别约为0.77、0.70和0.70,女性分别为0.79、0.70和0.71。特征重要性图显示,糖尿病家族史(famdb)、从不吸烟和高血压(hbp)是女性中最具影响力的特征,而famdb、hbp和当前吸烟者是男性中的主要变量。总之,PyCaret通过简化复杂的ML任务成功用于T2DM的预测。对于T2DM预测,性别差异很重要。尽管有这种全面的ML工具,但仅靠表型变量可能不足以进行早期T2DM预测;未来研究中也可结合使用基因型变量。