Suppr超能文献

GLIO-Select:基于机器学习的胶质母细胞瘤组织和血清蛋白质组学及代谢组学数据特征选择与加权揭示性别差异

GLIO-Select: Machine Learning-Based Feature Selection and Weighting of Tissue and Serum Proteomic and Metabolomic Data Uncovers Sex Differences in Glioblastoma.

作者信息

Tasci Erdal, Chappidi Shreya, Zhuge Ying, Zhang Longze, Cooley Zgela Theresa, Sproull Mary, Mackey Megan, Camphausen Kevin, Krauze Andra Valentina

机构信息

Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, Bethesda, MD 20892, USA.

出版信息

Int J Mol Sci. 2025 May 2;26(9):4339. doi: 10.3390/ijms26094339.

Abstract

Glioblastoma (GBM) is a fatal brain cancer known for its rapid and aggressive growth, with some studies indicating that females may have better survival outcomes compared to males. While sex differences in GBM have been observed, the underlying biological mechanisms remain poorly understood. Feature selection can lead to the identification of discriminative key biomarkers by reducing dimensionality from high-dimensional medical datasets to improve machine learning model performance, explainability, and interpretability. Feature selection can uncover unique sex-specific biomarkers, determinants, and molecular profiles in patients with GBM. We analyzed high-dimensional proteomic and metabolomic profiles from serum biospecimens obtained from 109 patients with pathology-proven glioblastoma (GBM) on NIH IRB-approved protocols with full clinical annotation (local dataset). Serum proteomic analysis was performed using Somalogic aptamer-based technology (measuring 7289 proteins) and serum metabolome analysis using the University of Florida's SECIM (Southeast Center for Integrated Metabolomics) platform (measuring 6015 metabolites). Machine learning-based feature selection was employed to identify proteins and metabolites associated with male and female labels in high-dimensional datasets. Results were compared to publicly available proteomic and metabolomic datasets (CPTAC and TCGA) using the same methodology and TCGA data previously structured for glioma grading. Employing a machine learning-based and hybrid feature selection approach, utilizing both LASSO and mRMR, in conjunction with a rank-based weighting method (i.e., GLIO-Select), we linked proteomic and metabolomic data to clinical data for the purposes of feature reduction to identify molecular biomarkers associated with biological sex in patients with GBM and used a separate TCGA set to explore possible linkages between biological sex and mutations associated with tumor grading. Serum proteomic and metabolomic data identified several hundred features that were associated with the male/female class label in the GBM datasets. Using the local serum-based dataset of 109 patients, 17 features (100% ACC) and 16 features (92% ACC) were identified for the proteomic and metabolomic datasets, respectively. Using the CPTAC tissue-based dataset (8828 proteomic and 59 metabolomic features), 5 features (99% ACC) and 13 features (80% ACC) were identified for the proteomic and metabolomic datasets, respectively. The proteomic data serum or tissue (CPTAC) achieved the highest accuracy rates (100% and 99%, respectively), followed by serum metabolome and tissue metabolome. The local serum data yielded several clinically known features (PSA, PZP, HCG, and FSH) which were distinct from CPTAC tissue data (RPS4Y1 and DDX3Y), both providing methodological validation, with PZP and defensins (DEFA3 and DEFB4A) representing shared proteomic features between serum and tissue. Metabolomic features shared between serum and tissue were homocysteine and pantothenic acid. Several signals emerged that are known to be associated with glioma or GBM but not previously known to be associated with biological sex, requiring further research, as well as several novel signals that were previously not linked to either biological sex or glioma. EGFR, FAT4, and BCOR were the three features associated with 64% ACC using the TCGA glioma grading set. GLIO-Select shows remarkable results in reducing feature dimensionality when different types of datasets (e.g., serum and tissue-based) were used for our analyses. The proposed approach successfully reduced relevant features to less than twenty biomarkers for each GBM dataset. Serum biospecimens appear to be highly effective for identifying biologically relevant sex differences in GBM. These findings suggest that serum-based noninvasive biospecimen-based analyses may provide more accurate and clinically detailed insights into sex as a biological variable (SABV) as compared to other biospecimens, with several signals linking sex differences and glioma pathology via immune response, amino acid metabolism, and cancer hallmark signals requiring further research. Our results underscore the importance of biospecimen choice and feature selection in enhancing the interpretation of omics data for understanding sex-based differences in GBM. This discovery holds significant potential for enhancing personalized treatment plans and patient outcomes.

摘要

胶质母细胞瘤(GBM)是一种致命的脑癌,以其快速且侵袭性的生长而闻名,一些研究表明,与男性相比,女性可能具有更好的生存结果。虽然已观察到GBM中的性别差异,但其潜在的生物学机制仍知之甚少。特征选择可以通过将高维医学数据集降维来识别有鉴别力的关键生物标志物,从而提高机器学习模型的性能、可解释性和可解读性。特征选择可以揭示GBM患者独特的性别特异性生物标志物、决定因素和分子特征。我们分析了从109例经病理证实的胶质母细胞瘤(GBM)患者的血清生物样本中获得的高维蛋白质组学和代谢组学谱,这些样本是按照美国国立卫生研究院(NIH)机构审查委员会批准的方案获取的,并带有完整的临床注释(本地数据集)。血清蛋白质组分析使用基于Somalogic适配体的技术(测量7289种蛋白质),血清代谢组分析使用佛罗里达大学的SECIM(东南综合代谢组学中心)平台(测量6015种代谢物)。基于机器学习的特征选择被用于识别高维数据集中与男性和女性标签相关的蛋白质和代谢物。使用相同的方法,并将之前为胶质瘤分级构建的TCGA数据作为参考,将结果与公开可用的蛋白质组学和代谢组学数据集(CPTAC和TCGA)进行比较。采用基于机器学习的混合特征选择方法,结合LASSO和mRMR,并使用基于排名的加权方法(即GLIO-Select),我们将蛋白质组学和代谢组学数据与临床数据相联系,以进行特征约简,从而识别GBM患者中与生物性别相关的分子生物标志物,并使用单独的TCGA数据集来探索生物性别与肿瘤分级相关突变之间的可能联系。血清蛋白质组学和代谢组学数据在GBM数据集中识别出了数百个与男性/女性类别标签相关的特征。使用109例患者的本地血清数据集,蛋白质组学和代谢组学数据集分别识别出17个特征(准确率100%)和16个特征(准确率92%)。使用基于CPTAC组织的数据集(8828个蛋白质组学特征和59个代谢组学特征),蛋白质组学和代谢组学数据集分别识别出5个特征(准确率99%)和13个特征(准确率80%)。蛋白质组学数据(血清或组织(CPTAC))达到了最高准确率(分别为100%和99%),其次是血清代谢组和组织代谢组。本地血清数据产生了几个临床上已知的特征(前列腺特异性抗原(PSA)、妊娠带蛋白(PZP)、人绒毛膜促性腺激素(HCG)和促卵泡激素(FSH)),这些特征与CPTAC组织数据(RPS4Y1和DDX3Y)不同,两者都提供了方法学验证,其中PZP和防御素(DEFA3和DEFB4A)代表血清和组织之间共享的蛋白质组学特征。血清和组织之间共享的代谢组学特征是同型半胱氨酸和泛酸。出现了几个已知与胶质瘤或GBM相关但以前未知与生物性别相关的信号,需要进一步研究,还有几个以前与生物性别或胶质瘤都没有关联的新信号。使用TCGA胶质瘤分级数据集时,表皮生长因子受体(EGFR)、FAT4和BCOR这三个特征的准确率为64%。当使用不同类型的数据集(如血清和基于组织的数据集)进行分析时,GLIO-Select在降低特征维度方面显示出显著结果。所提出的方法成功地将每个GBM数据集的相关特征减少到少于20个生物标志物。血清生物样本似乎对于识别GBM中生物学上相关的性别差异非常有效。这些发现表明,与其他生物样本相比,基于血清的非侵入性生物样本分析可能为作为生物变量(SABV)的性别提供更准确和临床上更详细的见解,有几个信号通过免疫反应、氨基酸代谢和癌症特征信号将性别差异与胶质瘤病理学联系起来,需要进一步研究。我们的结果强调了生物样本选择和特征选择在增强组学数据解释以理解GBM中基于性别的差异方面的重要性。这一发现对于改进个性化治疗方案和患者预后具有巨大潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a5a/12072282/39f3f88b3741/ijms-26-04339-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验