Faculty of Medicine, Department of Medical Informatics, Mashhad University of Medical Sciences, Mashhad, Iran.
Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran.
J Cancer Res Clin Oncol. 2023 Dec;149(19):17133-17146. doi: 10.1007/s00432-023-05388-5. Epub 2023 Sep 29.
Breast cancer (BC) is a multifactorial disease and is one of the most common cancers globally. This study aimed to compare different machine learning (ML) techniques to develop a comprehensive breast cancer risk prediction model based on features of various factors.
The population sample contained 810 records (115 cancer patients and 695 healthy individuals). 45 attributes out of 85 were selected based on the opinion of experts. These selected attributes are in genetic, biochemical, biomarker, gender, demographic and pathological factors. 13 Machine learning models were trained with proposed attributes and coefficient of attributes and internal relationships were calculated.
Compared to other methods random forest (RF) has higher performance (accuracy 99.26%, precision 99%, and area under the curve (AUC) 99%). The results of assessing the impact and correlation of variables using the RF method based on PCA indicated that pathology, biomarker, biochemistry, gene, and demographic factors with a coefficient of 0.35, 0.23, 0.15, 0.14, and 0.13 respectively, affected the risk of BC (r = 0.54).
Breast cancer has several risk factors. Medical experts use these risk factors for early diagnosis. Therefore, identifying related risk factors and their effect can increase the accuracy of diagnosis. Considering the broad features for predicting breast cancer leads to the development of a comprehensive prediction model. In this study, using RF technique a breast cancer prediction model with 99.3% accuracy was developed based on multifactorial features.
乳腺癌(BC)是一种多因素疾病,也是全球最常见的癌症之一。本研究旨在比较不同的机器学习(ML)技术,以基于各种因素的特征开发综合的乳腺癌风险预测模型。
人群样本包含 810 条记录(115 名癌症患者和 695 名健康个体)。根据专家意见,从 85 个特征中选择了 45 个特征。这些选定的特征涉及遗传、生化、生物标志物、性别、人口统计学和病理学因素。使用提出的属性和属性系数训练了 13 种机器学习模型,并计算了属性系数和内部关系。
与其他方法相比,随机森林(RF)具有更高的性能(准确率为 99.26%,精度为 99%,曲线下面积(AUC)为 99%)。使用基于 PCA 的 RF 方法评估变量的影响和相关性的结果表明,病理学、生物标志物、生化、基因和人口统计学因素的系数分别为 0.35、0.23、0.15、0.14 和 0.13,这些因素对乳腺癌的风险有影响(r=0.54)。
乳腺癌有多种危险因素。医学专家使用这些危险因素进行早期诊断。因此,确定相关的危险因素及其影响可以提高诊断的准确性。考虑到预测乳腺癌的广泛特征,可以开发综合的预测模型。在本研究中,使用 RF 技术,基于多因素特征开发了一种准确率为 99.3%的乳腺癌预测模型。