Suppr超能文献

使用 XGBoost 机器学习模型对肿瘤类型进行分类:基因组改变的向量空间变换。

Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations.

机构信息

Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, 67100, L'Aquila, Italy.

Center for Molecular Diagnostics and Advanced Therapies, University of L'Aquila, Via Petrini, 67100, L'Aquila, Italy.

出版信息

J Transl Med. 2023 Nov 21;21(1):836. doi: 10.1186/s12967-023-04720-4.

Abstract

BACKGROUND

Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool to improve cancer diagnosis, particularly for tumors of unknown origin.

METHODS

TCGA data from 9,927 samples spanning 32 different cancer types were downloaded from cBioportal. A vector space model type data transformation technique was designed to build consistently homogeneous new datasets containing, as predictive features, calls for somatic point mutations and copy number variations at chromosome arm-level, thus allowing the use of the XGBoost classifier models. Considering the imbalance in the dataset, due to large difference in the number of cases for each tumor, two preprocessing strategies were considered: i) setting a percentage cut-off threshold to remove less represented cancer types, ii) dividing cancer types into different groups based on biological criteria and training a specific XGBoost model for each of them. The performance of all trained models was mainly assessed by the out-of-sample balanced accuracy (BACC) and the AUC scores.

RESULTS

The XGBoost classifier achieved the best performance (BACC 77%; AUC 97%) on a dataset containing the 10 most represented tumor types. Moreover, dividing the 18 most represented cancers into three different groups (endocrine-related carcinomas, other carcinomas and other cancers),such analysis models achieved 78%, 71% and 86% BACC, respectively, with AUC scores greater than 96%. In addition, the model capable of linking each group to a specific cancer type reached 81% BACC and 94% AUC. Overall, the diagnostic potential of our model was comparable/higher with respect to others already described in literature and based on similar molecular data and ML approaches.

CONCLUSIONS

A boosted ML approach able to accurately discriminate different cancer types was developed. The methodology builds datasets simpler and more interpretable than the original data, while keeping enough information to accurately train standard ML models without resorting to sophisticated Deep Learning architectures. In combination with histopathological examinations, this approach could improve cancer diagnosis by using specific DNA alterations, processed by a replicable and easy-to-use automated technology. The study encourages new investigations which could further increase the classifier's performance, for example by considering more features and dividing tumors into their main molecular subtypes.

摘要

背景

机器学习(ML)是一种强大的工具,可以捕捉分子改变与癌症类型之间的关系,并提取生物学信息。在这里,我们开发了一种简单的 ML 模型,旨在基于遗传损伤区分癌症类型,为提高癌症诊断水平提供了一种额外的工具,特别是对于来源不明的肿瘤。

方法

从 cBioportal 下载了来自 32 种不同癌症类型的 9927 个样本的 TCGA 数据。设计了一种向量空间模型类型的数据转换技术,用于构建一致的同质新数据集,其中包含体细胞点突变和染色体臂水平的拷贝数变异的预测特征,从而允许使用 XGBoost 分类器模型。考虑到数据集的不平衡,由于每个肿瘤的病例数量差异较大,因此考虑了两种预处理策略:i)设置百分比截止阈值以去除代表性较少的癌症类型,ii)根据生物学标准将癌症类型分为不同组,并为每个组训练特定的 XGBoost 模型。所有训练模型的性能主要通过样本外平衡准确性(BACC)和 AUC 评分来评估。

结果

XGBoost 分类器在包含 10 种最具代表性肿瘤类型的数据集上取得了最佳性能(BACC 77%;AUC 97%)。此外,将 18 种最具代表性的癌症分为三组(内分泌相关癌、其他癌和其他癌症),这种分析模型的 BACC 分别为 78%、71%和 86%,AUC 评分均大于 96%。此外,能够将每组与特定癌症类型关联的模型达到了 81%BACC 和 94%AUC。总体而言,与基于类似分子数据和 ML 方法的其他文献中已经描述的模型相比,我们的模型具有相当或更高的诊断潜力。

结论

开发了一种能够准确区分不同癌症类型的增强型 ML 方法。该方法构建的数据集比原始数据更简单、更易于理解,同时保留了足够的信息,无需使用复杂的深度学习架构即可准确训练标准 ML 模型。与组织病理学检查相结合,该方法可以通过使用特定的 DNA 改变来提高癌症诊断水平,这些改变由可重复且易于使用的自动化技术处理。该研究鼓励进行新的研究,以进一步提高分类器的性能,例如考虑更多的特征,并将肿瘤分为其主要的分子亚型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f70a/10664515/293826661f3c/12967_2023_4720_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验