健康记录分析中针对极度不平衡和少量少数群体数据问题的过采样和欠采样方法。

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis.

作者信息

Fujiwara Koichi, Huang Yukun, Hori Kentaro, Nishioji Kenichi, Kobayashi Masao, Kamaguchi Mai, Kano Manabu

机构信息

Department of Material Process Engineering, Nagoya University, Nagoya, Japan.

Department of Systems Science, Kyoto University, Kyoto, Japan.

出版信息

Front Public Health. 2020 May 19;8:178. doi: 10.3389/fpubh.2020.00178. eCollection 2020.

DOI:10.3389/fpubh.2020.00178

PMID:32509717

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7248318/

Abstract

A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis.

摘要

由于医疗系统数字化的最新进展，大量的健康记录（HR）数据得以存储。然而，分析HR数据并非总是易事，尤其是当患有目标疾病的人数与总体相比过少时。这种情况被称为数据不平衡问题。过采样和欠采样是纠正少数和多数样本之间不平衡的两种方法，它们可以组合成集成算法。然而，当少数样本的绝对数量很少时，这些方法就不起作用了，这被称为极度不平衡和少数样本（EISM）数据问题。本研究提出了一种新的算法，称为结合启发式欠采样和基于分布采样的提升算法（HUSDOS-Boost），以解决EISM数据问题。为了从原始的不平衡数据集中创建一个人工平衡的数据集，HUSDOS-Boost同时使用欠采样和过采样，根据先前的提升结果消除冗余的多数样本，并按照少数类分布生成人工少数样本。通过将HUSDOS-Boost应用于八个不平衡数据集，对其性能和特点进行了评估。此外，该算法还应用于原始临床HR数据，以检测胃癌患者。这些结果表明，HUSDOS-Boost优于当前的数据不平衡处理方法，特别是在数据为EISM的情况下。因此，所提出的HUSDOS-Boost是一种有用的HR数据分析方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b839/7248318/a26c02e035a0/fpubh-08-00178-g0001.jpg

相似文献

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis.健康记录分析中针对极度不平衡和少量少数群体数据问题的过采样和欠采样方法。

Front Public Health. 2020 May 19;8:178. doi: 10.3389/fpubh.2020.00178. eCollection 2020.

Classifying adverse drug reactions from imbalanced twitter data.从不平衡的推特数据中分类药物不良反应。

Int J Med Inform. 2019 Sep;129:122-132. doi: 10.1016/j.ijmedinf.2019.05.017. Epub 2019 May 30.

RSMOTE: improving classification performance over imbalanced medical datasets.RSMOTE：提升不平衡医学数据集的分类性能

Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.

Transfer Boosting With Synthetic Instances for Class Imbalanced Object Recognition.基于合成样本的类别不平衡目标识别转移提升。

IEEE Trans Cybern. 2018 Jan;48(1):357-370. doi: 10.1109/TCYB.2016.2636370. Epub 2016 Dec 22.

Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model.通过结合改进的大趋势扩散和装袋极限学习机模型的新型混合采样，改进不平衡医学数据集的支持向量机分类。

Math Biosci Eng. 2023 Sep 15;20(10):17672-17701. doi: 10.3934/mbe.2023786.

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.一种有效的算法与合成少数过采样技术相结合，用于对不平衡的 PubChem BioAssay 数据进行分类。

Anal Chim Acta. 2014 Jan 2;806:117-27. doi: 10.1016/j.aca.2013.10.050. Epub 2013 Nov 6.

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data.一种用于对不平衡数据进行分类的基于聚类的SMOTE双边采样（CSBBoost）集成算法。

Sci Rep. 2024 Mar 2;14(1):5152. doi: 10.1038/s41598-024-55598-1.

A comprehensive data level analysis for cancer diagnosis on imbalanced data.针对不平衡数据进行癌症诊断的全面数据级别分析。

J Biomed Inform. 2019 Feb;90:103089. doi: 10.1016/j.jbi.2018.12.003. Epub 2019 Jan 3.

Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle.机器学习算法、公牛遗传信息和不平衡数据集用于伊朗荷斯坦奶牛流产发生率预测模型。

Prev Vet Med. 2020 Feb;175:104869. doi: 10.1016/j.prevetmed.2019.104869. Epub 2019 Dec 17.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.基于结构-活性关系的高度不平衡Tox21数据集的化学分类

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

引用本文的文献

A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study.一种在有限数据场景中采用过采样技术的机器学习算法，用于预测当前和未来的修复治疗需求：开发与验证研究

JMIR Med Inform. 2025 Aug 28;13:e75117. doi: 10.2196/75117.

Deep Genomics: Deep Learning-Based Analysis of Genome-Sequenced Data for Identification of Gene Alterations.深度基因组学：基于深度学习的基因组测序数据分析以识别基因改变

Methods Mol Biol. 2025;2952:335-367. doi: 10.1007/978-1-0716-4690-8_20.

Fault prediction of aircraft engine based on adaptive hybrid sampling and BiLSTM.基于自适应混合采样和双向长短期记忆网络的航空发动机故障预测

Sci Rep. 2025 Apr 21;15(1):13726. doi: 10.1038/s41598-025-98756-9.

Using machine learning and single nucleotide polymorphisms for improving rheumatoid arthritis risk Prediction in postmenopausal women.利用机器学习和单核苷酸多态性改善绝经后女性类风湿关节炎风险预测

PLOS Digit Health. 2025 Apr 9;4(4):e0000790. doi: 10.1371/journal.pdig.0000790. eCollection 2025 Apr.

A multicenter validation and calibration of automated software package for detecting anterior circulation large vessel occlusion on CT angiography.用于CT血管造影术检测前循环大血管闭塞的自动化软件包的多中心验证与校准

BMC Neurol. 2025 Mar 10;25(1):100. doi: 10.1186/s12883-025-04107-6.

Machine learning for precision diagnostics of autoimmunity.机器学习在自身免疫性疾病精准诊断中的应用。

Sci Rep. 2024 Nov 13;14(1):27848. doi: 10.1038/s41598-024-76093-7.

Machine learning model for predicting the cold-heat pattern in Kampo medicine: a multicenter prospective observational study.用于预测汉方医学寒热证型的机器学习模型：一项多中心前瞻性观察性研究。

Front Pharmacol. 2024 Oct 25;15:1412593. doi: 10.3389/fphar.2024.1412593. eCollection 2024.

Deep transfer learning for detection of breast arterial calcifications on mammograms: a comparative study.基于深度迁移学习的乳腺钼靶动脉钙化检测：一项对比研究。

Eur Radiol Exp. 2024 Jul 15;8(1):80. doi: 10.1186/s41747-024-00478-6.

Health services satisfaction and medical exclusion among migrant youths in Gauteng Province of South Africa: A cross-sectional analysis of the GCRO survey (2017-2018).南非豪登省移民青年的卫生服务满意度和医疗排斥：GCRO 调查（2017-2018 年）的横断面分析。

PLoS One. 2023 Nov 29;18(11):e0293958. doi: 10.1371/journal.pone.0293958. eCollection 2023.

Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data.重采样数据以解决类别不平衡问题的影响（IRCIP）：医学数据中分类算法间性能影响的评估

JAMIA Open. 2023 May 31;6(2):ooad033. doi: 10.1093/jamiaopen/ooad033. eCollection 2023 Jul.

本文引用的文献

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants.基于不平衡学习的罕见病和常见病相关非编码变异预测

Sci Rep. 2017 Jun 7;7(1):2959. doi: 10.1038/s41598-017-03011-5.

-Omic and Electronic Health Record Big Data Analytics for Precision Medicine.用于精准医学的组学与电子健康记录大数据分析

IEEE Trans Biomed Eng. 2017 Feb;64(2):263-273. doi: 10.1109/TBME.2016.2573285. Epub 2016 Oct 10.

Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning.基于进化聚类的合成过采样集成（ECO-Ensemble）在不平衡学习中的应用。

IEEE Trans Cybern. 2017 Sep;47(9):2850-2861. doi: 10.1109/TCYB.2016.2579658. Epub 2016 Jun 21.

Epileptic Seizure Prediction Based on Multivariate Statistical Process Control of Heart Rate Variability Features.基于心率变异性特征多元统计过程控制的癫痫发作预测

IEEE Trans Biomed Eng. 2016 Jun;63(6):1321-32. doi: 10.1109/TBME.2015.2512276. Epub 2015 Dec 24.

Strategies for handling missing data in electronic health record derived data.电子健康记录衍生数据中缺失数据的处理策略。

EGEMS (Wash DC). 2013 Dec 17;1(3):1035. doi: 10.13063/2327-9214.1035. eCollection 2013.

Intestinal Calcium Absorption Decreases Dramatically After Gastric Bypass Surgery Despite Optimization of Vitamin D Status.尽管维生素D状态已得到优化，但胃旁路手术后肠道钙吸收仍显著下降。

J Bone Miner Res. 2015 Aug;30(8):1377-85. doi: 10.1002/jbmr.2467. Epub 2015 May 21.

From promise to reality: achieving the value of an EHR.从承诺到现实：实现电子健康记录的价值。

Healthc Financ Manage. 2011 Feb;65(2):50-6.

The emergence of national electronic health record architectures in the United States and Australia: models, costs, and questions.美国和澳大利亚国家电子健康记录架构的出现：模式、成本及问题

J Med Internet Res. 2005 Mar 14;7(1):e3. doi: 10.2196/jmir.7.1.e3.

Applications of multiple imputation in medical studies: from AIDS to NHANES.多重填补在医学研究中的应用：从艾滋病到美国国家健康与营养检查调查

Stat Methods Med Res. 1999 Mar;8(1):17-36. doi: 10.1177/096228029900800103.

Age and sex-dependent alterations of serum amylase and isoamylase levels in normal human adults.正常成年人血清淀粉酶和同工淀粉酶水平的年龄及性别依赖性变化

J Gastroenterol. 1994 Apr;29(2):189-91. doi: 10.1007/BF02358681.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

健康记录分析中针对极度不平衡和少量少数群体数据问题的过采样和欠采样方法。

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献