Suppr超能文献

一种用于医疗保健中高度不平衡数据分类的自检测自适应合成少数过采样技术算法(SASMOTE)。

A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare.

作者信息

Kosolwattana Tanapol, Liu Chenang, Hu Renjie, Han Shizhong, Chen Hua, Lin Ying

机构信息

Department of Industrial Engineering, University of Houston, Houston, USA.

School of Industrial Engineering & Management, Oklahoma State University, Stillwater, USA.

出版信息

BioData Min. 2023 Apr 25;16(1):15. doi: 10.1186/s13040-023-00330-4.

Abstract

In many healthcare applications, datasets for classification may be highly imbalanced due to the rare occurrence of target events such as disease onset. The SMOTE (Synthetic Minority Over-sampling Technique) algorithm has been developed as an effective resampling method for imbalanced data classification by oversampling samples from the minority class. However, samples generated by SMOTE may be ambiguous, low-quality and non-separable with the majority class. To enhance the quality of generated samples, we proposed a novel self-inspected adaptive SMOTE (SASMOTE) model that leverages an adaptive nearest neighborhood selection algorithm to identify the "visible" nearest neighbors, which are used to generate samples likely to fall into the minority class. To further enhance the quality of the generated samples, an uncertainty elimination via self-inspection approach is introduced in the proposed SASMOTE model. Its objective is to filter out the generated samples that are highly uncertain and inseparable with the majority class. The effectiveness of the proposed algorithm is compared with existing SMOTE-based algorithms and demonstrated through two real-world case studies in healthcare, including risk gene discovery and fatal congenital heart disease prediction. By generating the higher quality synthetic samples, the proposed algorithm is able to help achieve better prediction performance (in terms of F1 score) on average compared to the other methods, which is promising to enhance the usability of machine learning models on highly imbalanced healthcare data.

摘要

在许多医疗保健应用中,由于疾病发作等目标事件发生罕见,用于分类的数据集可能高度失衡。合成少数类过采样技术(SMOTE)算法已被开发出来,作为一种通过对少数类样本进行过采样来处理不平衡数据分类的有效重采样方法。然而,SMOTE生成的样本可能不明确、质量低且与多数类不可分离。为了提高生成样本的质量,我们提出了一种新颖的自检查自适应SMOTE(SASMOTE)模型,该模型利用自适应最近邻选择算法来识别“可见”最近邻,这些最近邻用于生成可能属于少数类的样本。为了进一步提高生成样本的质量,在所提出的SASMOTE模型中引入了一种通过自检查消除不确定性的方法。其目的是过滤掉高度不确定且与多数类不可分离的生成样本。将所提出算法的有效性与现有的基于SMOTE的算法进行了比较,并通过医疗保健领域的两个实际案例研究进行了验证,包括风险基因发现和致命先天性心脏病预测。通过生成更高质量的合成样本,与其他方法相比,所提出的算法平均能够帮助实现更好的预测性能(以F1分数衡量),这有望提高机器学习模型在高度不平衡医疗保健数据上的可用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36e7/10131309/f7e8bf2d4430/13040_2023_330_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验