Department of Computer Engineering and Faculty of Engineering, Marmara University, Istanbul, Turkey.
Department of Bioengineering, Faculty of Engineering, Marmara University, Istanbul, Turkey.
OMICS. 2022 Sep;26(9):504-511. doi: 10.1089/omi.2022.0089. Epub 2022 Aug 30.
The rise of machine learning (ML) has recently buttressed the efforts for big data-driven precision oncology. This study used ensemble ML for precision oncology in breast cancer, which is one of the most common malignancies worldwide with marked heterogeneity of the underlying molecular mechanisms. We analyzed clinical and RNA-seq data from The Cancer Genome Atlas (TCGA) (844 patients with breast cancer and 113 healthy individuals) and the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (1784 patients with breast cancer and 202 healthy individuals). We evaluated six algorithms in the context of ensemble modeling and identified a candidate mRNA diagnostic panel that can differentiate patients from healthy controls, and stratify breast cancer into molecular subtypes. The ensemble model included 50 mRNAs and displayed 82.55% accuracy, 79.22% specificity, and 84.55% sensitivity in stratifying patients into molecular subtypes in TCGA cohort. Its performance was markedly higher, however, in distinguishing the basal, LumB, and Her2+ breast cancer subtypes from healthy individuals. In overall survival analysis, the mRNA panel showed a hazard ratio of 2.25 ( = 5 × 10) for breast cancer and was significantly associated with molecular pathways related to carcinogenesis. In conclusion, an ensemble ML approach, including 50 mRNAs, was able to stratify patients with different breast cancer subtypes and differentiate them from healthy individuals. Future prospective studies in large samples with deep phenotyping can help advance the ensemble ML approaches in breast cancer. Advanced ML methods such as ensemble learning are timely additions to the precision oncology research toolbox.
机器学习(ML)的兴起最近为大数据驱动的精准肿瘤学提供了支持。本研究将集成机器学习应用于乳腺癌精准肿瘤学,乳腺癌是全球最常见的恶性肿瘤之一,其潜在分子机制具有明显的异质性。我们分析了来自癌症基因组图谱(TCGA)(844 名乳腺癌患者和 113 名健康个体)和乳腺癌国际分子分类联盟(METABRIC)(1784 名乳腺癌患者和 202 名健康个体)的临床和 RNA-seq 数据。我们在集成建模的背景下评估了六种算法,并确定了一个候选的 mRNA 诊断面板,该面板可以区分患者与健康对照,并将乳腺癌分为分子亚型。集成模型包括 50 个 mRNA,在 TCGA 队列中区分患者的分子亚型时,其准确率为 82.55%,特异性为 79.22%,敏感性为 84.55%。然而,在区分健康个体中的基底、LumB 和 Her2+乳腺癌亚型方面,其性能明显更高。在总生存分析中,mRNA 面板显示乳腺癌的风险比为 2.25( = 5 × 10),并且与致癌相关分子途径显著相关。总之,包括 50 个 mRNA 的集成 ML 方法能够对不同的乳腺癌亚型患者进行分层,并将其与健康个体区分开来。未来在大样本中进行深入表型分析的前瞻性研究可以帮助推进乳腺癌中的集成 ML 方法。高级 ML 方法(如集成学习)是精准肿瘤学研究工具包的及时补充。