Wu Yue, Min Kai-Yuan, Liu Jiang-Feng, Liang Wan-Feng, Yang Ye-Hong, Hu Gang, Yang Jun-Tao
1 School of Statistics and Data Science,Nankai University,Tianjin 300071,China.
2 State Key Laboratory of Common Mechanism Research for Major Diseases,Institute of Basic Medical Sciences,CAMS and PUMC,Beijing 100005,China.
Zhongguo Yi Xue Ke Xue Yuan Xue Bao. 2024 Apr;46(2):147-153. doi: 10.3881/j.issn.1000-503X.15717.
Objective To screen out the biomarkers linked to prognosis of breast invasive carcinoma based on the analysis of transcriptome data by random forest (RF),extreme gradient boosting (XGBoost),light gradient boosting machine (LightGBM),and categorical boosting (CatBoost). Methods We obtained the expression data of breast invasive carcinoma from The Cancer Genome Atlas and employed DESeq2,-test,and Cox univariate analysis to identify the differentially expressed protein-coding genes associated with survival prognosis in human breast invasive carcinoma samples.Furthermore,RF,XGBoost,LightGBM,and CatBoost models were established to mine the protein-coding gene markers related to the prognosis of breast invasive cancer and the model performance was compared.The expression data of breast cancer from the Gene Expression Omnibus was used for validation. Results A total of 151 differentially expressed protein-coding genes related to survival prognosis were screened out.The machine learning model established with C3orf80,UGP2,and SPC25 demonstrated the best performance. Conclusions Three protein-coding genes (UGP2,C3orf80,and SPC25) were screened out to identify breast invasive carcinoma.This study provides a new direction for the treatment and diagnosis of breast invasive carcinoma.
目的 基于随机森林(RF)、极端梯度提升(XGBoost)、轻量级梯度提升机(LightGBM)和分类梯度提升(CatBoost)对转录组数据的分析,筛选出与乳腺浸润癌预后相关的生物标志物。方法 我们从癌症基因组图谱获取乳腺浸润癌的表达数据,并采用DESeq2、t检验和Cox单因素分析,以鉴定人乳腺浸润癌样本中与生存预后相关的差异表达蛋白质编码基因。此外,建立RF、XGBoost、LightGBM和CatBoost模型,挖掘与乳腺浸润癌预后相关的蛋白质编码基因标志物,并比较模型性能。使用基因表达综合数据库中的乳腺癌表达数据进行验证。结果 共筛选出151个与生存预后相关的差异表达蛋白质编码基因。用C3orf80、UGP2和SPC25建立的机器学习模型表现最佳。结论 筛选出三个蛋白质编码基因(UGP2、C3orf80和SPC25)用于识别乳腺浸润癌。本研究为乳腺浸润癌的治疗和诊断提供了新方向。