Sharma Amita, Verbeke Willem J M I
Department of Operations Research & Quantitative Analysis, Institute of Agri-Business Management, Swami Keshwanand Rajasthan Agricultural University, Bikaner, India.
Erasmus University, Rotterdam, Netherlands.
Front Big Data. 2020 Apr 30;3:15. doi: 10.3389/fdata.2020.00015. eCollection 2020.
Machine Learning has been on the rise and healthcare is no exception to that. In healthcare, mental health is gaining more and more space. The diagnosis of mental disorders is based upon standardized patient interviews with defined set of questions and scales which is a time consuming and costly process. Our objective was to apply the machine learning model and to evaluate to see if there is predictive power of biomarkers data to enhance the diagnosis of depression cases. In this research paper, we aimed to explore the detection of depression cases among the sample of 11,081 Dutch citizen dataset. Most of the earlier studies have balanced datasets wherein the proportion of healthy cases and unhealthy cases are equal but in our study, the dataset contains only 570 cases of self-reported depression out of 11,081 cases hence it is a class imbalance classification problem. The machine learning model built on imbalance dataset gives predictions biased toward majority class hence the model will always predict the case as no depression case even if it is a case of depression. We used different resampling strategies to address the class imbalance problem. We created multiple samples by under sampling, over sampling, over-under sampling and ROSE sampling techniques to balance the dataset and then, we applied machine learning algorithm "Extreme Gradient Boosting" (XGBoost) on each sample to classify the mental illness cases from healthy cases. The balanced accuracy, precision, recall and F1 score obtained from over-sampling and over-under sampling were more than 0.90.
机器学习一直在兴起,医疗保健领域也不例外。在医疗保健中,心理健康正占据越来越多的空间。精神障碍的诊断基于对患者进行标准化访谈,使用一系列特定的问题和量表,这是一个耗时且成本高昂的过程。我们的目标是应用机器学习模型,并评估生物标志物数据是否具有预测能力,以加强对抑郁症病例的诊断。在这篇研究论文中,我们旨在探索在11081名荷兰公民数据集样本中检测抑郁症病例。大多数早期研究的数据集是平衡的,其中健康病例和不健康病例的比例相等,但在我们的研究中,在11081个病例中,数据集仅包含570例自我报告的抑郁症病例,因此这是一个类别不平衡分类问题。基于不平衡数据集构建的机器学习模型会给出偏向多数类别的预测,因此即使是抑郁症病例,该模型也总是会将其预测为非抑郁症病例。我们使用了不同的重采样策略来解决类别不平衡问题。我们通过欠采样、过采样、过欠采样和ROSE采样技术创建了多个样本,以平衡数据集,然后,我们在每个样本上应用机器学习算法“极端梯度提升”(XGBoost),以将精神疾病病例与健康病例进行分类。通过过采样和过欠采样获得的平衡准确率、精确率、召回率和F1分数均超过0.90。