Sharifi Mahyar, Khatibi Toktam, Emamian Mohammad Hassan, Sadat Somayeh, Hashemi Hassan, Fotouhi Akbar
School of Industrial and Systems Engineering, Tarbiat Modares University, Tehran, Iran.
Ophthalmic Epidemiology Research Center, Shahroud University of Medical Sciences, Shahroud, Iran.
BioData Min. 2021 Nov 24;14(1):48. doi: 10.1186/s13040-021-00281-8.
To develop and to propose a machine learning model for predicting glaucoma and identifying its risk factors.
Data analysis pipeline is designed for this study based on Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. The main steps of the pipeline include data sampling, preprocessing, classification and evaluation and validation. Data sampling for providing the training dataset was performed with balanced sampling based on over-sampling and under-sampling methods. Data preprocessing steps were missing value imputation and normalization. For classification step, several machine learning models were designed for predicting glaucoma including Decision Trees (DTs), K-Nearest Neighbors (K-NN), Support Vector Machines (SVM), Random Forests (RFs), Extra Trees (ETs) and Bagging Ensemble methods. Moreover, in the classification step, a novel stacking ensemble model is designed and proposed using the superior classifiers.
The data were from Shahroud Eye Cohort Study including demographic and ophthalmology data for 5190 participants aged 40-64 living in Shahroud, northeast Iran. The main variables considered in this dataset were 67 demographics, ophthalmologic, optometric, perimetry, and biometry features for 4561 people, including 4474 non-glaucoma participants and 87 glaucoma patients. Experimental results show that DTs and RFs trained based on under-sampling of the training dataset have superior performance for predicting glaucoma than the compared single classifiers and bagging ensemble methods with the average accuracy of 87.61 and 88.87, the sensitivity of 73.80 and 72.35, specificity of 87.88 and 89.10 and area under the curve (AUC) of 91.04 and 94.53, respectively. The proposed stacking ensemble has an average accuracy of 83.56, a sensitivity of 82.21, a specificity of 81.32, and an AUC of 88.54.
In this study, a machine learning model is proposed and developed to predict glaucoma disease among persons aged 40-64. Top predictors in this study considered features for discriminating and predicting non-glaucoma persons from glaucoma patients include the number of the visual field detect on perimetry, vertical cup to disk ratio, white to white diameter, systolic blood pressure, pupil barycenter on Y coordinate, age, and axial length.
开发并提出一种用于预测青光眼及其危险因素的机器学习模型。
基于跨行业数据挖掘标准流程(CRISP-DM)方法为该研究设计数据分析管道。该管道的主要步骤包括数据采样、预处理、分类以及评估与验证。基于过采样和欠采样方法的平衡采样用于提供训练数据集的数据采样。数据预处理步骤包括缺失值插补和归一化。在分类步骤中,设计了几种用于预测青光眼的机器学习模型,包括决策树(DTs)、K近邻(K-NN)、支持向量机(SVM)、随机森林(RFs)、极端随机树(ETs)和装袋集成方法。此外,在分类步骤中,使用性能优越的分类器设计并提出了一种新颖的堆叠集成模型。
数据来自沙赫鲁德眼病队列研究,包括居住在伊朗东北部沙赫鲁德的5190名年龄在40 - 64岁参与者的人口统计学和眼科数据。该数据集中考虑的主要变量是4561人的67个人口统计学、眼科、验光、视野和生物测量特征,其中包括4474名非青光眼参与者和87名青光眼患者。实验结果表明,基于训练数据集欠采样训练的决策树和随机森林在预测青光眼方面比所比较的单个分类器和装袋集成方法具有更优越的性能,平均准确率分别为87.61和88.87,灵敏度分别为73.80和72.35,特异性分别为87.88和89.10,曲线下面积(AUC)分别为91.04和94.53。所提出的堆叠集成模型的平均准确率为83.56,灵敏度为82.21,特异性为81.32,AUC为88.54。
在本研究中,提出并开发了一种机器学习模型来预测40 - 64岁人群中的青光眼疾病。本研究中用于区分和预测非青光眼患者与青光眼患者的顶级预测因素包括视野检测次数、垂直杯盘比、白对白直径、收缩压、瞳孔Y坐标重心、年龄和眼轴长度。