Selvaraj Sharanya, Alsayed Alhuseen Omar, Ismail Nor Azman, Kavin Balasubramanian Prabhu, Onyema Edeh Michael, Seng Gan Hong, Uchechi Arinze Queen
Department of Data Science and Business Systems, SRM Institute of Science and Technology, Kattankulathur, Chennai, India, 603203.
Faculty of Computing, UniversitiTeknologi Malaysia, Johor Bahru, Malaysia.
Discov Oncol. 2024 Sep 27;15(1):499. doi: 10.1007/s12672-024-01337-x.
Leukemia is a form of cancer that affects the bone marrow and lymphatic system, and it requires complex treatment strategies that vary with each subtype. Due to the subtle morphological differences among these types, monitoring gene expressions is crucial for accurate classification. Manual or pathological testing can be time-consuming and expensive. Therefore, data-driven methods and machine learning algorithms offer an efficient alternative for leukemia classification. This study introduced a novel super learning model that leverages heterogeneous machine learning models to analyze gene expression data and classify leukemia cells. The proposed approach incorporates an entropy-based feature importance technique to identify the gene profiles most significant to the labeling process. The strength of this super learning model lies in its final super learner, Random Forest, which effectively classifies cross-validated data from the candidate learners. Validation on a gene expression monitoring dataset demonstrates that this model outperforms other state-of-the-art models in predictive accuracy. The study contributes to the knowledge regarding the use of advanced machine learning techniques to improve the accuracy and reliability of leukemia classification using gene expression data, addressing the challenges of traditional methods that rely on clinical features and morphological examination.
白血病是一种影响骨髓和淋巴系统的癌症形式,它需要复杂的治疗策略,且每种亚型的治疗策略各不相同。由于这些类型之间存在细微的形态学差异,监测基因表达对于准确分类至关重要。手动或病理检测可能既耗时又昂贵。因此,数据驱动方法和机器学习算法为白血病分类提供了一种高效的替代方案。本研究引入了一种新型的超级学习模型,该模型利用异构机器学习模型来分析基因表达数据并对白血病细胞进行分类。所提出的方法结合了基于熵的特征重要性技术,以识别对标记过程最重要的基因谱。这个超级学习模型的优势在于其最终的超级学习器——随机森林,它能有效地对候选学习器的交叉验证数据进行分类。在一个基因表达监测数据集上的验证表明,该模型在预测准确性方面优于其他现有最先进的模型。这项研究有助于了解使用先进机器学习技术来提高利用基因表达数据进行白血病分类的准确性和可靠性,解决了依赖临床特征和形态学检查的传统方法所面临的挑战。