Tasci Erdal, Zhuge Ying, Camphausen Kevin, Krauze Andra V
Center for Cancer Research, National Cancer Institute, NIH, Building 10, Bethesda, MD 20892, USA.
Department of Computer Engineering, Ege University, Izmir 35100, Turkey.
Cancers (Basel). 2022 Jun 12;14(12):2897. doi: 10.3390/cancers14122897.
Recent technological developments have led to an increase in the size and types of data in the medical field derived from multiple platforms such as proteomic, genomic, imaging, and clinical data. Many machine learning models have been developed to support precision/personalized medicine initiatives such as computer-aided detection, diagnosis, prognosis, and treatment planning by using large-scale medical data. Bias and class imbalance represent two of the most pressing challenges for machine learning-based problems, particularly in medical (e.g., oncologic) data sets, due to the limitations in patient numbers, cost, privacy, and security of data sharing, and the complexity of generated data. Depending on the data set and the research question, the methods applied to address class imbalance problems can provide more effective, successful, and meaningful results. This review discusses the essential strategies for addressing and mitigating the class imbalance problems for different medical data types in the oncologic domain.
最近的技术发展导致医学领域中来自蛋白质组学、基因组学、成像和临床数据等多个平台的数据规模和类型不断增加。已经开发了许多机器学习模型,通过使用大规模医疗数据来支持精准/个性化医疗计划,如计算机辅助检测、诊断、预后和治疗规划。偏差和类不平衡是基于机器学习的问题面临的两个最紧迫的挑战,特别是在医学(如肿瘤学)数据集中,这是由于患者数量、成本、数据共享的隐私和安全性以及生成数据的复杂性方面的限制。根据数据集和研究问题,用于解决类不平衡问题的方法可以提供更有效、成功和有意义的结果。本综述讨论了在肿瘤学领域中针对不同医学数据类型解决和减轻类不平衡问题的基本策略。