Department of Systems Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong.
Graduate School of Public Health, St. Luke's International University, Tokyo, Japan.
J Healthc Eng. 2018 May 22;2018:6275435. doi: 10.1155/2018/6275435. eCollection 2018.
Identifying rare but significant healthcare events in massive unstructured datasets has become a common task in healthcare data analytics. However, imbalanced class distribution in many practical datasets greatly hampers the detection of rare events, as most classification methods implicitly assume an equal occurrence of classes and are designed to maximize the overall classification accuracy. In this study, we develop a framework for learning healthcare data with imbalanced distribution via incorporating different rebalancing strategies. The evaluation results showed that the developed framework can significantly improve the detection accuracy of medical incidents due to look-alike sound-alike (LASA) mix-ups. Specifically, logistic regression combined with the synthetic minority oversampling technique (SMOTE) produces the best detection results, with a significant 45.3% increase in recall (recall = 75.7%) compared with pure logistic regression (recall = 52.1%).
在大规模非结构化数据集中识别罕见但重要的医疗保健事件已成为医疗数据分析中的一项常见任务。然而,许多实际数据集中的类别分布不平衡极大地阻碍了罕见事件的检测,因为大多数分类方法隐含地假设类别出现的频率相等,并且旨在最大化整体分类准确性。在这项研究中,我们通过结合不同的再平衡策略来开发一种用于处理不平衡分布的医疗保健数据的学习框架。评估结果表明,由于类似发音的混淆 (LASA),所开发的框架可以显著提高医学事件的检测准确性。具体来说,逻辑回归与合成少数过采样技术 (SMOTE) 相结合可以产生最佳的检测结果,与纯逻辑回归 (召回率为 52.1%) 相比,召回率 (召回率 = 75.7%) 显著提高了 45.3%。