Data Scientist at AFRY AB, Sweden.
Department of Medical Informatics, Faculty of Health Mashhad University of Medical Sciences, Mashhad, Iran.
J Prev Med Hyg. 2024 Aug 31;65(2):E221-E226. doi: 10.15167/2421-4248/jpmh2024.65.2.3045. eCollection 2024 Jun.
Low survival rates of breast cancer in developing countries are mainly due to the lack of early detection plans and adequate diagnosis and treatment facilities.
This study aimed to apply machine learning techniques to recognize the most important breast cancer risk factors.
This case-control study included women aged 17-75 years who were referred to medical centers affiliated with Mashhad University of Medical Science between March 21, 2015, and March 19, 2016. The study had two datasets: one with 516 samples (258 cases and 258 controls) and another with 606 samples (303 cases and 303 controls). Written informed consent has been observed. Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and Principal Component Analysis (PCA) were applied using R studio software.
Regarding the DT and RF, the most important features that impact breast cancer were family cancer, individual history of breast cancer, biopsy sampling, rarely consumption of a dairy, fruit, and vegetable meal, while in PCA and LR these features including family cancer, pregnancy number, pregnancy tendency, abortion, first menstruation, the age of first childbirth and childbirth number.
Machine learning algorithms can be used to extract the most important factors in the diagnosis of breast cancer in developing countries such as Iran.
发展中国家乳腺癌的生存率较低,主要是由于缺乏早期检测计划以及足够的诊断和治疗设施。
本研究旨在应用机器学习技术来识别乳腺癌的最重要危险因素。
这项病例对照研究纳入了年龄在 17-75 岁之间的女性,这些女性于 2015 年 3 月 21 日至 2016 年 3 月 19 日期间被转诊到马什哈德医科大学附属医院。该研究有两个数据集:一个包含 516 个样本(258 个病例和 258 个对照),另一个包含 606 个样本(303 个病例和 303 个对照)。所有参与者均签署了书面知情同意书。决策树(DT)、随机森林(RF)、逻辑回归(LR)和主成分分析(PCA)使用 R studio 软件进行分析。
就 DT 和 RF 而言,影响乳腺癌的最重要特征是家族癌症、个人乳腺癌病史、活检采样、很少食用乳制品、水果和蔬菜餐,而在 PCA 和 LR 中,这些特征包括家族癌症、怀孕次数、怀孕倾向、流产、初潮、初产年龄和生育次数。
机器学习算法可用于提取伊朗等发展中国家乳腺癌诊断的最重要因素。