Sri Guru Tegh Bahadur Khalsa College, University of Delhi, Delhi, India.
Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India.
Comput Methods Programs Biomed. 2024 Jan;243:107922. doi: 10.1016/j.cmpb.2023.107922. Epub 2023 Nov 7.
One of the most prevalent birth disorders is congenital heart diseases (CHD). Although CHD risk factors have been the subject of numerous studies, their propensity to cause CHD has not been tested. Particularly few research has attempted to forecast CHD risk using population-based cross-sectional data, which is inherently imbalanced.
The main goals of this study are to create a reliable data analysis model that can help with (i) a better understanding of congenital heart disease prediction in the presence of missing and unbalanced data and (ii) creating cohorts of expectant mothers with similar lifestyle characteristics.
Clusters of patient cohorts are produced using the unsupervised data mining technique density-based spatial clustering of applications with noise (DBSCAN). For more accurate CHD prediction, a random forest model was trained using these clusters and their corresponding patterns. This study uses a dataset of 33,831 expectant mothers to make its prediction. Missing data were handled using the k-NN imputation approach, while extremely unbalanced data were balanced using SMOTE. These techniques are all data-driven and need little to no user or expert involvement.
Using DBSCAN, three cohorts were found. The cluster information enhanced the random forest-based CHD prediction and revealed intricate factors that influence prediction accuracy. The proposed approach gave the highest results with 99 % accuracy and 0.91 AUC and performed better than the state-of-the-art methodologies. Hence, the suggested method using unsupervised learning can provide intricate information to the classifier and further enhance the performance of the classification.
先天性心脏病(CHD)是最常见的出生缺陷之一。尽管已有大量研究探讨了 CHD 的危险因素,但这些因素导致 CHD 的倾向尚未得到验证。特别是很少有研究试图使用基于人群的横断面数据来预测 CHD 风险,而这种数据本质上是不平衡的。
本研究的主要目的是创建一个可靠的数据分析模型,以帮助(i)更好地理解存在缺失和不平衡数据时的先天性心脏病预测,以及(ii)创建具有相似生活方式特征的孕妇队列。
使用无监督数据挖掘技术基于密度的空间聚类应用程序的噪声(DBSCAN)生成患者队列的簇。为了更准确地预测 CHD,使用这些簇及其对应的模式训练随机森林模型。本研究使用了 33831 名孕妇的数据集进行预测。使用 k-NN 插补方法处理缺失数据,而使用 SMOTE 平衡极度不平衡的数据。这些技术都是数据驱动的,几乎不需要用户或专家的参与。
使用 DBSCAN 发现了三个队列。簇信息增强了基于随机森林的 CHD 预测,并揭示了影响预测准确性的复杂因素。所提出的方法在 99%的准确率和 0.91 AUC 下取得了最高的结果,并且比最先进的方法表现更好。因此,使用无监督学习的建议方法可以为分类器提供复杂的信息,并进一步提高分类的性能。