Department of Genome Biology, Faculty of Medicine, Kindai University, Ohnohigashi 377-2, Osaka-Sayama, 589-9511, Japan.
Department of Medical Oncology, Faculty of Medicine, Kindai University, Osaka-Sayama, Japan.
Int J Clin Oncol. 2024 Dec;29(12):1795-1810. doi: 10.1007/s10147-024-02617-w. Epub 2024 Sep 18.
Genome DNA methylation profiling is a promising yet costly method for cancer classification, involving substantial data. We developed an ensemble learning model to identify cancer types using methylation profiles from a limited number of CpG sites.
Analyzing methylation data from 890 samples across 10 cancer types from the TCGA database, we utilized ANOVA and Gain Ratio to select the most significant CpG sites, then employed Gradient Boosting to reduce these to just 100 sites.
This approach maintained high accuracy across multiple machine learning models, with classification accuracy rates between 87.7% and 93.5% for methods including Extreme Gradient Boosting, CatBoost, and Random Forest. This method effectively minimizes the number of features needed without losing performance, helping to classify primary organs and uncover subgroups within specific cancers like breast and lung.
Using a gradient boosting feature selector shows potential for streamlining methylation-based cancer classification.
基因组 DNA 甲基化分析是一种有前途但昂贵的癌症分类方法,涉及大量数据。我们开发了一个集成学习模型,使用来自有限数量 CpG 位点的甲基化谱来识别癌症类型。
分析 TCGA 数据库中 10 种癌症类型的 890 个样本的甲基化数据,我们利用方差分析和增益比选择最显著的 CpG 位点,然后利用梯度提升将其减少到仅 100 个位点。
这种方法在多种机器学习模型中保持了较高的准确性,包括极端梯度提升、CatBoost 和随机森林在内的方法的分类准确率在 87.7%到 93.5%之间。这种方法有效地最小化了所需特征的数量,而不会降低性能,有助于对原发性器官进行分类,并揭示特定癌症(如乳腺癌和肺癌)中的亚组。
使用梯度提升特征选择器显示出简化基于甲基化的癌症分类的潜力。