一种使用先进特征选择和降维技术检测早期乳腺癌的新型双机器学习方法。

A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques.

作者信息

Athisayamani Suganya, S Tamilazhagan, Singh A Robert, Hwang Jae-Yong, Joshi Gyanendra Prasad

机构信息

Department of Computing Technologies, SRM Institute of Science and Technology, Kattankulathur, Tamil Nadu, India.

School of Computing, Sastra Deemed to be University, Thanjavur, Tamil Nadu, India.

出版信息

Sci Rep. 2025 Jul 2;15(1):22971. doi: 10.1038/s41598-025-06426-7.

DOI:10.1038/s41598-025-06426-7

PMID:40596255

Abstract

In this paper, three Double Machine Learning (DML) models are proposed to enhance the accuracy of breast cancer detection using machine learning techniques using breast cancer detection dataset. The DML models learn the primary features using machine learning and deep learning models. Then, these features are fused by a meta-classifier to achieve the best classification performance. The first DML model combines the interpretability of Random Forest (RF) with the deep learning capabilities of a Feedforward Neural Network (FNN). RF processes structured features, providing class probabilities and feature importance scores, while the FNN learns non-linear relationships and generates embeddings. These outputs are fused into a combined feature vector, which is then used by a meta-classifier for final predictions. This approach effectively captures both structured features and non-linear patterns, making it suitable for datasets with complex dependencies. The second model pairs eXtreme Gradient Boosting (XGBoost), a highly efficient boosting algorithm for tabular data, with an Artificial Neural Network (ANN). XGBoost optimizes decision tree ensembles and provides class probabilities, while the ANN processes numerical data to learn deeper representations. A meta-classifier then uses the fused outputs from both XGBoost and ANN for final predictions. This model is particularly effective for datasets combining structured features (handled by XGBoost) with numerical features (handled by ANN). The third model integrates LightGBM, a fast and scalable gradient-boosting framework, with an ANN, which is well-suited for analyzing sequential data. LightGBM processes structured features to provide probabilities and importance scores, while the ANN learns temporal dependencies from sequential data. The outputs from LightGBM and ANN are concatenated and passed into a meta-classifier for decision-making. This model is ideal for datasets with both static features (LightGBM) and continuous data (ANN), such as time-series datasets or datasets with sequential dependencies. These DML models, when combined with dimensionality reduction (PCA) and feature selection, significantly improve the performance of breast cancer detection systems by leveraging both structured and sequential data with high accuracy of 0.99.

摘要

在本文中，提出了三种双机器学习（DML）模型，以利用乳腺癌检测数据集通过机器学习技术提高乳腺癌检测的准确性。DML模型使用机器学习和深度学习模型学习主要特征。然后，这些特征由一个元分类器融合，以实现最佳的分类性能。第一个DML模型将随机森林（RF）的可解释性与前馈神经网络（FNN）的深度学习能力相结合。RF处理结构化特征，提供类概率和特征重要性分数，而FNN学习非线性关系并生成嵌入。这些输出被融合成一个组合特征向量，然后由一个元分类器用于最终预测。这种方法有效地捕获了结构化特征和非线性模式，使其适用于具有复杂依赖关系的数据集。第二个模型将用于表格数据的高效提升算法极端梯度提升（XGBoost）与人工神经网络（ANN）配对。XGBoost优化决策树集成并提供类概率，而ANN处理数值数据以学习更深层次的表示。然后，一个元分类器使用来自XGBoost和ANN的融合输出进行最终预测。该模型对于将结构化特征（由XGBoost处理）与数值特征（由ANN处理）相结合的数据集特别有效。第三个模型将快速且可扩展的梯度提升框架LightGBM与适合分析序列数据的ANN集成。LightGBM处理结构化特征以提供概率和重要性分数，而ANN从序列数据中学习时间依赖性。LightGBM和ANN的输出被连接起来并传递到一个元分类器中进行决策。该模型适用于具有静态特征（LightGBM）和连续数据（ANN）的数据集，例如时间序列数据集或具有序列依赖性的数据集。这些DML模型与降维（PCA）和特征选择相结合时，通过利用结构化和序列数据，以0.99的高精度显著提高了乳腺癌检测系统的性能。