Zhou Pei-Yuan, Lum Faith, Wang Tony Jiecao, Bhatti Anubhav, Parmar Surajsinh, Dan Chen, Wong Andrew K C
Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
AI Engineering Team, SpassMed Inc., Toronto, ON M5H 2S6, Canada.
Bioengineering (Basel). 2024 Jul 31;11(8):770. doi: 10.3390/bioengineering11080770.
Medical datasets may be imbalanced and contain errors due to subjective test results and clinical variability. The poor quality of original data affects classification accuracy and reliability. Hence, detecting abnormal samples in the dataset can help clinicians make better decisions. In this study, we propose an unsupervised error detection method using patterns discovered by the Pattern Discovery and Disentanglement (PDD) model, developed in our earlier work. Applied to the large data, the eICU Collaborative Research Database for sepsis risk assessment, the proposed algorithm can effectively discover statistically significant association patterns, generate an interpretable knowledge base for interpretability, cluster samples in an unsupervised learning manner, and detect abnormal samples from the dataset. As shown in the experimental result, our method outperformed K-Means by 38% on the full dataset and 47% on the reduced dataset for unsupervised clustering. Multiple supervised classifiers improve accuracy by an average of 4% after removing abnormal samples by the proposed error detection approach. Therefore, the proposed algorithm provides a robust and practical solution for unsupervised clustering and error detection in healthcare data.
医学数据集可能存在不平衡问题,并且由于主观测试结果和临床变异性而包含错误。原始数据质量差会影响分类的准确性和可靠性。因此,检测数据集中的异常样本有助于临床医生做出更好的决策。在本研究中,我们提出了一种无监督错误检测方法,该方法使用我们早期工作中开发的模式发现与解缠(PDD)模型发现的模式。应用于大数据——用于脓毒症风险评估的eICU协作研究数据库,所提出的算法可以有效地发现具有统计学意义的关联模式,生成可解释的知识库以实现可解释性,以无监督学习的方式对样本进行聚类,并从数据集中检测异常样本。实验结果表明,在无监督聚类方面,我们的方法在完整数据集上比K均值算法性能提升了38%,在精简数据集上提升了47%。通过所提出的错误检测方法去除异常样本后,多个监督分类器的准确率平均提高了4%。因此,所提出的算法为医疗保健数据中的无监督聚类和错误检测提供了一种强大而实用的解决方案。