Department of Mathematics, Near East University, Nicosia, 99138, Turkey.
Department of Computer Engineering, Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran.
BMC Med Inform Decis Mak. 2023 Oct 16;23(1):219. doi: 10.1186/s12911-023-02323-z.
After the World Health Organization declared the COVID-19 pandemic, the role of Vitamin D has become even more critical for people worldwide. The most accurate way to define vitamin D level is 25-hydroxy vitamin D(25-OH-D) blood test. However, this blood test is not always feasible. Most data sets used in health science research usually contain highly correlated features, which is referred to as multicollinearity problem. This problem can lead to misleading results and overfitting problems in the ML training process. Therefore, the proposed study aims to determine a clinically acceptable ML model for the detection of the vitamin D status of the North Cyprus adult participants accurately, without the need to determine 25-OH-D level, taking into account the multicollinearity problem.
The study was conducted with 481 observations who applied voluntarily to Internal Medicine Department at NEU Hospital. The classification performance of four conventional supervised ML models, namely, Ordinal logistic regression(OLR), Elastic-net ordinal regression(ENOR), Support Vector Machine(SVM), and Random Forest (RF) was compared. The comparative analysis is performed regarding the model's sensitivity to the participant's metabolic syndrome(MtS)'positive status, hyper-parameter tuning, sensitivities to the size of training data, and the classification performance of the models.
Due to the presence of multicollinearity, the findings showed that the performance of the SVM(RBF) is obviously negatively affected when the test is examined. Moreover, it can be obviously detected that RF is more robust than other models when the variations in the size of training data are examined. This experiment's result showed that the selected RF and ENOR showed better performances than the other two models when the size of training samples was reduced. Since the multicollinearity is more severe in the small samples, it can be concluded that RF and ENOR are not affected by the presence of the multicollinearity problem. The comparative analysis revealed that the RF classifier performed better and was more robust than the other proposed models in terms of accuracy (0.94), specificity (0.96), sensitivity or recall (0.94), precision (0.95), F1-score (0.95), and Cohen's kappa (0.90).
It is evident that the RF achieved better than the SVM(RBF), ENOR, and OLR. These comparison findings will be applied to develop a Vitamin D level intelligent detection system for being used in routine clinical, biochemical tests, and lifestyle characteristics of individuals to decrease the cost and time of vitamin D level detection.
世界卫生组织宣布 COVID-19 大流行后,维生素 D 的作用对于全世界的人们来说变得更加重要。定义维生素 D 水平最准确的方法是 25-羟维生素 D(25-OH-D)血液检测。然而,这种血液检测并不总是可行的。健康科学研究中使用的大多数数据集通常包含高度相关的特征,这被称为多重共线性问题。这个问题可能会导致机器学习(ML)训练过程中的误导性结果和过拟合问题。因此,拟议的研究旨在确定一种临床上可接受的 ML 模型,以准确检测北塞浦路斯成年参与者的维生素 D 状态,而无需确定 25-OH-D 水平,同时考虑到多重共线性问题。
该研究共纳入 481 名自愿到 NEU 医院内科就诊的观察对象。比较了四种常规监督式机器学习模型(有序逻辑回归(OLR)、弹性网络有序回归(ENOR)、支持向量机(SVM)和随机森林(RF))的分类性能。比较分析了模型对参与者代谢综合征(MtS)阳性状态的敏感性、超参数调整、对训练数据大小的敏感性以及模型的分类性能。
由于存在多重共线性,研究结果表明,当对测试进行检查时,SVM(RBF)的性能明显受到负面影响。此外,当检查训练数据大小的变化时,可以明显检测到 RF 比其他模型更稳健。本实验结果表明,当训练样本量减少时,选择的 RF 和 ENOR 比其他两种模型表现更好。由于小样本中多重共线性更为严重,因此可以得出结论,RF 和 ENOR 不受多重共线性问题的影响。比较分析表明,RF 分类器在准确性(0.94)、特异性(0.96)、敏感性或召回率(0.94)、精度(0.95)、F1 分数(0.95)和 Cohen's kappa(0.90)方面的表现优于其他提出的模型。
RF 的表现明显优于 SVM(RBF)、ENOR 和 OLR。这些比较结果将应用于开发一种维生素 D 水平智能检测系统,用于常规临床、生化检测和个体生活方式特征,以降低维生素 D 水平检测的成本和时间。