Suppr超能文献

开发和验证一种基于可靠 DNA 拷贝数的机器学习算法(CopyClust),用于乳腺癌综合聚类分类。

Development and validation of a reliable DNA copy-number-based machine learning algorithm (CopyClust) for breast cancer integrative cluster classification.

机构信息

Cancer Research UK Cambridge Institute and Department of Oncology, Li Ka Shing Centre, University of Cambridge, Cambridge, UK.

Harvard Medical School, Boston, MA, USA.

出版信息

Sci Rep. 2024 May 24;14(1):11861. doi: 10.1038/s41598-024-62724-6.

Abstract

The Integrative Cluster subtypes (IntClusts) provide a framework for the classification of breast cancer tumors into 10 distinct groups based on copy number and gene expression, each with unique biological drivers of disease and clinical prognoses. Gene expression data is often lacking, and accurate classification of samples into IntClusts with copy number data alone is essential. Current classification methods achieve low accuracy when gene expression data are absent, warranting the development of new approaches to IntClust classification. Copy number data from 1980 breast cancer samples from METABRIC was used to train multiclass XGBoost machine learning algorithms (CopyClust). A piecewise constant fit was applied to the average copy number profile of each IntClust and unique breakpoints across the 10 profiles were identified and converted into ~ 500 genomic regions used as features for CopyClust. These models consisted of two approaches: a 10-class model with the final IntClust label predicted by a single multiclass model and a 6-class model with binary reclassification in which four pairs of IntClusts were combined for initial multiclass classification. Performance was validated on the TCGA dataset, with copy number data generated from both SNP arrays and WES platforms. CopyClust achieved 81% and 79% overall accuracy with the TCGA SNP and WES datasets, respectively, a nine-percentage point or greater improvement in overall IntClust subtype classification accuracy. CopyClust achieves a significant improvement over current methods in classification accuracy of IntClust subtypes for samples without available gene expression data and is an easily implementable algorithm for IntClust classification of breast cancer samples with copy number data.

摘要

整合聚类亚型(IntClusts)为乳腺癌肿瘤的分类提供了一个框架,根据拷贝数和基因表达将肿瘤分为 10 个不同的组,每个组都有独特的疾病生物学驱动因素和临床预后。通常缺乏基因表达数据,仅使用拷贝数数据准确地将样本分类到 IntClusts 中至关重要。当缺乏基因表达数据时,当前的分类方法准确性较低,因此需要开发新的 IntClust 分类方法。使用 METABRIC 中的 1980 个乳腺癌样本的拷贝数数据来训练多类 XGBoost 机器学习算法(CopyClust)。对每个 IntClust 的平均拷贝数谱应用分段常数拟合,并确定 10 个谱中的独特断点,并将其转换为约 500 个基因组区域,用作 CopyClust 的特征。这些模型包括两种方法:一种是使用单个多类模型预测最终 IntClust 标签的 10 类模型,另一种是使用二进制重新分类的 6 类模型,其中四个 IntClust 对被组合用于初始多类分类。在 TCGA 数据集上验证了性能,该数据集使用 SNP 阵列和 WES 平台生成的拷贝数数据。CopyClust 在 TCGA SNP 和 WES 数据集上的总体准确率分别为 81%和 79%,总体 IntClust 亚型分类准确率提高了九个百分点以上。对于没有可用基因表达数据的样本,CopyClust 在 IntClust 亚型分类准确性方面取得了显著提高,并且是一种易于实现的用于具有拷贝数数据的乳腺癌样本的 IntClust 分类的算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a40a/11126405/7734f4da8d30/41598_2024_62724_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验