Suppr超能文献

SMOTE-CD:针对组合数据的 SMOTE 方法。

SMOTE-CD: SMOTE for compositional data.

机构信息

Laboratoire de Mathématiques et de leurs Applications, Université de Pau et des Pays de l'Adour, E2S UPPA, CNRS, Anglet, France.

School of Mathematics and Physical Sciences, Macquarie University, Sydney, NSW, Australia.

出版信息

PLoS One. 2023 Jun 29;18(6):e0287705. doi: 10.1371/journal.pone.0287705. eCollection 2023.

Abstract

Compositional data are a special kind of data, represented as a proportion carrying relative information. Although this type of data is widely spread, no solution exists to deal with the cases where the classes are not well balanced. After describing compositional data imbalance, this paper proposes an adaptation of the original Synthetic Minority Oversampling TEchnique (SMOTE) to deal with compositional data imbalance. The new approach, called SMOTE for Compositional Data (SMOTE-CD), generates synthetic examples by computing a linear combination of selected existing data points, using compositional data operations. The performance of the SMOTE-CD is tested with three different regressors (Gradient Boosting tree, Neural Networks, Dirichlet regressor) applied to two real datasets and to synthetic generated data, and the performance is evaluated using accuracy, cross-entropy, F1-score, R2 score and RMSE. The results show improvements across all metrics, but the impact of oversampling on performance varies depending on the model and the data. In some cases, oversampling may lead to a decrease in performance for the majority class. However, for the real data, the best performance across all models is achieved when oversampling is used. Notably, the F1-score is consistently increased with oversampling. Unlike the original technique, the performance is not improved when combining oversampling of the minority classes and undersampling of the majority class. The Python package smote-cd implements the method and is available online.

摘要

成分数据是一种特殊的数据类型,表现为携带相对信息的比例。尽管这种类型的数据广泛存在,但对于类不平衡的情况,目前还没有解决方案。本文在描述成分数据不平衡后,提出了一种原始合成少数过采样技术(SMOTE)的改编版,用于处理成分数据不平衡。新方法称为成分数据的 SMOTE(SMOTE-CD),通过使用成分数据操作计算选定现有数据点的线性组合来生成合成示例。SMOTE-CD 的性能使用三种不同的回归器(梯度提升树、神经网络、狄利克雷回归器)应用于两个真实数据集和合成生成的数据进行测试,并使用准确性、交叉熵、F1 分数、R2 分数和 RMSE 来评估性能。结果表明,所有指标的性能都有所提高,但过采样对性能的影响因模型和数据而异。在某些情况下,过采样可能会导致多数类的性能下降。然而,对于真实数据,在使用过采样时,所有模型都能达到最佳性能。值得注意的是,F1 分数随着过采样而持续增加。与原始技术不同,当结合少数类的过采样和多数类的欠采样时,性能不会得到提高。Python 包 smote-cd 实现了该方法,并可在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/28f2/10309641/0ab21d066d14/pone.0287705.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验