Yu Shiqun, Wang Chengman, Ouyang Jin, Luo Ting, Zeng Fanfan, Zhang Yu, Gao Liyun, Huang Shaoxin, Wang Xin
Yunfu Center for Disease Control and Prevention, Yunfu, China.
Jiangxi Provincial Key Laboratory of Cell Precision Therapy, School of Basic Medical Sciences , Jiujiang University, Jiujiang, 332005, Jiangxi, China.
Sci Rep. 2025 Mar 13;15(1):8770. doi: 10.1038/s41598-025-93208-w.
Breast cancer (BC) is the second leading cause of cancer-related death in females, followed by lung cancer. Disadvantages exist in conventional diagnostic techniques of BC, such as radiation risk. The present study integrated bioinformatics analysis with machine learning to elucidate potential key candidate genes associated with the tumorigenesis of BC. Eleven datasets were downloaded from the Gene Expression Omnibus (GEO) database and were consolidated into two independent cohorts (training cohort and validation cohort) after batch-effect removal. We employed "limma" package to screen differentially expressed genes (DEGs) between BC and adjacent normal breast samples. Subsequently, the most reliable diagnostic indicators were identified utilizing LASSO-Logistic regression, SVM-RFE and multivariate stepwise Logistic regression analysis. Logistic model and nomogram were created based on these hub genes and applied in external validation cohort to verify the robustness of the model. As a result, a total of six hub genes connected with BC pathogenesis were identified, including CD300LG, IGSF10, FAM83D, MAMDC2, COMP and SEMA3G. Then, a diagnostic model of BC on the basis of these genes was established. ROC analysis of the diagnostic model illustrated that AUC of the training cohort was 0.978 (0.962, 0.995). In the validation cohort, AUC of training set and validation set were 0.936 (0.910, 0.961) and 0.921 (0.870, 0.972), respectively. This indicated that the model was reliable in separating BC patients from healthy individuals. The model may assist in early diagnosis of BC with implications for improving the prognosis of BC patients.
乳腺癌(BC)是女性癌症相关死亡的第二大主要原因,仅次于肺癌。BC的传统诊断技术存在缺点,如辐射风险。本研究将生物信息学分析与机器学习相结合,以阐明与BC肿瘤发生相关的潜在关键候选基因。从基因表达综合数据库(GEO)下载了11个数据集,并在去除批次效应后整合为两个独立队列(训练队列和验证队列)。我们使用“limma”软件包筛选BC与相邻正常乳腺样本之间的差异表达基因(DEG)。随后,利用LASSO逻辑回归、支持向量机递归特征消除(SVM-RFE)和多变量逐步逻辑回归分析确定最可靠的诊断指标。基于这些核心基因创建了逻辑模型和列线图,并应用于外部验证队列以验证模型的稳健性。结果,共鉴定出6个与BC发病机制相关的核心基因,包括CD300LG、IGSF10、FAM83D、MAMDC2、COMP和SEMA3G。然后,基于这些基因建立了BC的诊断模型。诊断模型的受试者工作特征(ROC)分析表明,训练队列的曲线下面积(AUC)为0.978(0.962,0.995)。在验证队列中,训练集和验证集的AUC分别为0.936(0.910,0.961)和0.921(0.870,0.972)。这表明该模型在区分BC患者和健康个体方面是可靠的。该模型可能有助于BC的早期诊断,对改善BC患者的预后具有重要意义。