School of Life Sciences, Shanghai University, Shanghai 200444, China.
Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
Biomed Res Int. 2022 Dec 28;2022:5297235. doi: 10.1155/2022/5297235. eCollection 2022.
Sarcoma, the second common type of solid tumor in children and adolescents, has a wide variety of subtypes that are often not properly diagnosed at an early stage, leading to late metastases and causing serious loss of life and property to patients and families. It exhibits a high degree of heterogeneity at the cellular, molecular, and epigenetic levels, where DNA methylation has been proposed to play a role in the diagnosis of sarcoma subtypes. Thus, this study is aimed at finding potential biomarkers at the DNA methylation level to distinguish different sarcoma subtypes. A machine learning process was designed to analyse sarcoma samples, each of which was represented by lots of methylation sites. Irrelevant sites were removed using the Boruta method, and remaining sites related to the target variables were kept for further analyses. Afterward, three feature ranking methods (LASSO, LightGBM, and MCFS) were adopted to rank these features, and six classification models were constructed by combining incremental feature selection and two classification algorithms (decision tree and random forest). Among these models, the performance of RF model was higher than that of DT model under all three ranking conditions. The specific expression of genes obtained from the annotation of highly correlated methylation site features, such as PRKAR1B, INPP5A, and GLI3, was proven to be associated with sarcoma by publications. Moreover, the quantitative rules obtained by decision tree algorithm helped us to understand the essential differences between various sarcoma types and classify sarcoma subtypes, providing a new means of clinical identification and determining new therapeutic targets.
肉瘤是儿童和青少年中第二常见的实体肿瘤类型,有多种亚型,这些亚型在早期往往不能得到正确诊断,导致晚期转移,给患者和家庭造成严重的生命和财产损失。它在细胞、分子和表观遗传水平上表现出高度的异质性,其中 DNA 甲基化被认为在肉瘤亚型的诊断中发挥作用。因此,本研究旨在寻找 DNA 甲基化水平上的潜在生物标志物,以区分不同的肉瘤亚型。设计了一个机器学习过程来分析肉瘤样本,每个样本都由大量的甲基化位点表示。使用 Boruta 方法去除不相关的位点,并保留与目标变量相关的剩余位点进行进一步分析。然后,采用三种特征排序方法(LASSO、LightGBM 和 MCFS)对这些特征进行排序,并通过增量特征选择和两种分类算法(决策树和随机森林)构建了六个分类模型。在这六种模型中,在所有三种排序条件下,RF 模型的性能均高于 DT 模型。通过注释高度相关的甲基化位点特征(如 PRKAR1B、INPP5A 和 GLI3)获得的基因的表达,通过文献证明与肉瘤有关。此外,决策树算法获得的定量规则有助于我们了解各种肉瘤类型之间的本质差异,并对肉瘤亚型进行分类,为临床鉴定提供了新的手段,并确定了新的治疗靶点。