Suppr超能文献

一种用于医疗保健分析中检测错误标签的无监督错误检测方法。

An Unsupervised Error Detection Methodology for Detecting Mislabels in Healthcare Analytics.

作者信息

Zhou Pei-Yuan, Lum Faith, Wang Tony Jiecao, Bhatti Anubhav, Parmar Surajsinh, Dan Chen, Wong Andrew K C

机构信息

Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada.

AI Engineering Team, SpassMed Inc., Toronto, ON M5H 2S6, Canada.

出版信息

Bioengineering (Basel). 2024 Jul 31;11(8):770. doi: 10.3390/bioengineering11080770.

Abstract

Medical datasets may be imbalanced and contain errors due to subjective test results and clinical variability. The poor quality of original data affects classification accuracy and reliability. Hence, detecting abnormal samples in the dataset can help clinicians make better decisions. In this study, we propose an unsupervised error detection method using patterns discovered by the Pattern Discovery and Disentanglement (PDD) model, developed in our earlier work. Applied to the large data, the eICU Collaborative Research Database for sepsis risk assessment, the proposed algorithm can effectively discover statistically significant association patterns, generate an interpretable knowledge base for interpretability, cluster samples in an unsupervised learning manner, and detect abnormal samples from the dataset. As shown in the experimental result, our method outperformed K-Means by 38% on the full dataset and 47% on the reduced dataset for unsupervised clustering. Multiple supervised classifiers improve accuracy by an average of 4% after removing abnormal samples by the proposed error detection approach. Therefore, the proposed algorithm provides a robust and practical solution for unsupervised clustering and error detection in healthcare data.

摘要

医学数据集可能存在不平衡问题,并且由于主观测试结果和临床变异性而包含错误。原始数据质量差会影响分类的准确性和可靠性。因此,检测数据集中的异常样本有助于临床医生做出更好的决策。在本研究中,我们提出了一种无监督错误检测方法,该方法使用我们早期工作中开发的模式发现与解缠(PDD)模型发现的模式。应用于大数据——用于脓毒症风险评估的eICU协作研究数据库,所提出的算法可以有效地发现具有统计学意义的关联模式,生成可解释的知识库以实现可解释性,以无监督学习的方式对样本进行聚类,并从数据集中检测异常样本。实验结果表明,在无监督聚类方面,我们的方法在完整数据集上比K均值算法性能提升了38%,在精简数据集上提升了47%。通过所提出的错误检测方法去除异常样本后,多个监督分类器的准确率平均提高了4%。因此,所提出的算法为医疗保健数据中的无监督聚类和错误检测提供了一种强大而实用的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fc3/11351123/790bd2ecfda6/bioengineering-11-00770-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验