开发和验证一种基于可靠 DNA 拷贝数的机器学习算法（CopyClust），用于乳腺癌综合聚类分类。

Development and validation of a reliable DNA copy-number-based machine learning algorithm (CopyClust) for breast cancer integrative cluster classification.

机构信息

Cancer Research UK Cambridge Institute and Department of Oncology, Li Ka Shing Centre, University of Cambridge, Cambridge, UK.

Harvard Medical School, Boston, MA, USA.

出版信息

Sci Rep. 2024 May 24;14(1):11861. doi: 10.1038/s41598-024-62724-6.

DOI:10.1038/s41598-024-62724-6

PMID:38789621

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11126405/

Abstract

The Integrative Cluster subtypes (IntClusts) provide a framework for the classification of breast cancer tumors into 10 distinct groups based on copy number and gene expression, each with unique biological drivers of disease and clinical prognoses. Gene expression data is often lacking, and accurate classification of samples into IntClusts with copy number data alone is essential. Current classification methods achieve low accuracy when gene expression data are absent, warranting the development of new approaches to IntClust classification. Copy number data from 1980 breast cancer samples from METABRIC was used to train multiclass XGBoost machine learning algorithms (CopyClust). A piecewise constant fit was applied to the average copy number profile of each IntClust and unique breakpoints across the 10 profiles were identified and converted into ~ 500 genomic regions used as features for CopyClust. These models consisted of two approaches: a 10-class model with the final IntClust label predicted by a single multiclass model and a 6-class model with binary reclassification in which four pairs of IntClusts were combined for initial multiclass classification. Performance was validated on the TCGA dataset, with copy number data generated from both SNP arrays and WES platforms. CopyClust achieved 81% and 79% overall accuracy with the TCGA SNP and WES datasets, respectively, a nine-percentage point or greater improvement in overall IntClust subtype classification accuracy. CopyClust achieves a significant improvement over current methods in classification accuracy of IntClust subtypes for samples without available gene expression data and is an easily implementable algorithm for IntClust classification of breast cancer samples with copy number data.

摘要

整合聚类亚型（IntClusts）为乳腺癌肿瘤的分类提供了一个框架，根据拷贝数和基因表达将肿瘤分为 10 个不同的组，每个组都有独特的疾病生物学驱动因素和临床预后。通常缺乏基因表达数据，仅使用拷贝数数据准确地将样本分类到 IntClusts 中至关重要。当缺乏基因表达数据时，当前的分类方法准确性较低，因此需要开发新的 IntClust 分类方法。使用 METABRIC 中的 1980 个乳腺癌样本的拷贝数数据来训练多类 XGBoost 机器学习算法（CopyClust）。对每个 IntClust 的平均拷贝数谱应用分段常数拟合，并确定 10 个谱中的独特断点，并将其转换为约 500 个基因组区域，用作 CopyClust 的特征。这些模型包括两种方法：一种是使用单个多类模型预测最终 IntClust 标签的 10 类模型，另一种是使用二进制重新分类的 6 类模型，其中四个 IntClust 对被组合用于初始多类分类。在 TCGA 数据集上验证了性能，该数据集使用 SNP 阵列和 WES 平台生成的拷贝数数据。CopyClust 在 TCGA SNP 和 WES 数据集上的总体准确率分别为 81%和 79%，总体 IntClust 亚型分类准确率提高了九个百分点以上。对于没有可用基因表达数据的样本，CopyClust 在 IntClust 亚型分类准确性方面取得了显著提高，并且是一种易于实现的用于具有拷贝数数据的乳腺癌样本的 IntClust 分类的算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a40a/11126405/7734f4da8d30/41598_2024_62724_Fig1_HTML.jpg

相似文献

Development and validation of a reliable DNA copy-number-based machine learning algorithm (CopyClust) for breast cancer integrative cluster classification.

Sci Rep. 2024 May 24;14(1):11861. doi: 10.1038/s41598-024-62724-6.

Genome-driven integrated classification of breast cancer validated in over 7,500 samples.

Genome Biol. 2014 Aug 28;15(8):431. doi: 10.1186/s13059-014-0431-1.

Associations between genomic stratification of breast cancer and centrally reviewed tumour pathology in the METABRIC cohort.

NPJ Breast Cancer. 2018 Mar 7;4:5. doi: 10.1038/s41523-018-0056-8. eCollection 2018.

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture.

Med Biol Eng Comput. 2023 Nov;61(11):2895-2919. doi: 10.1007/s11517-023-02892-1. Epub 2023 Aug 2.

Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.

Bioinformatics. 2009 Nov 15;25(22):2906-12. doi: 10.1093/bioinformatics/btp543. Epub 2009 Sep 16.

Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations.

J Transl Med. 2023 Nov 21;21(1):836. doi: 10.1186/s12967-023-04720-4.

Molecular and epigenetic profiles of BRCA1-like hormone-receptor-positive breast tumors identified with development and application of a copy-number-based classifier.

Breast Cancer Res. 2019 Jan 25;21(1):14. doi: 10.1186/s13058-018-1090-z.

A systematic comparison of copy number alterations in four types of female cancer.

BMC Cancer. 2016 Nov 22;16(1):913. doi: 10.1186/s12885-016-2899-4.

Molecular features and survival outcomes of the intrinsic subtypes within HER2-positive breast cancer.

J Natl Cancer Inst. 2014 Aug 19;106(8). doi: 10.1093/jnci/dju152. Print 2014 Aug.

Identification of Novel Breast Cancer Subtype-Specific Biomarkers by Integrating Genomics Analysis of DNA Copy Number Aberrations and miRNA-mRNA Dual Expression Profiling.

Biomed Res Int. 2015;2015:746970. doi: 10.1155/2015/746970. Epub 2015 Apr 15.

引用本文的文献

A new insight into the impact of copy number variations on cell cycle deregulation of luminal-type breast cancer.

Oncol Rev. 2025 Feb 12;19:1516409. doi: 10.3389/or.2025.1516409. eCollection 2025.

Advancing precision and personalized breast cancer treatment through multi-omics technologies.

Am J Cancer Res. 2024 Dec 15;14(12):5614-5627. doi: 10.62347/MWNZ5609. eCollection 2024.

本文引用的文献

Molecular classification of hormone receptor-positive HER2-negative breast cancer.

Nat Genet. 2023 Oct;55(10):1696-1708. doi: 10.1038/s41588-023-01507-7. Epub 2023 Sep 28.

DNA methylation landscapes of 1538 breast cancers reveal a replication-linked clock, epigenomic instability and cis-regulation.

Nat Commun. 2021 Sep 13;12(1):5406. doi: 10.1038/s41467-021-25661-w.

External validation of prognostic models: what, why, how, when and where?

Clin Kidney J. 2020 Nov 24;14(1):49-58. doi: 10.1093/ckj/sfaa188. eCollection 2021 Jan.

Dynamics of breast-cancer relapse reveal late-recurring ER-positive genomic subgroups.

Nature. 2019 Mar;567(7748):399-404. doi: 10.1038/s41586-019-1007-8. Epub 2019 Mar 13.

Breast Cancer Molecular Stratification: From Intrinsic Subtypes to Integrative Clusters.

Am J Pathol. 2017 Oct;187(10):2152-2162. doi: 10.1016/j.ajpath.2017.04.022. Epub 2017 Jul 19.

The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes.

Nat Commun. 2016 May 10;7:11479. doi: 10.1038/ncomms11479.

Systematic pan-cancer analysis of tumour purity.

Nat Commun. 2015 Dec 4;6:8971. doi: 10.1038/ncomms9971.

Genome-driven integrated classification of breast cancer validated in over 7,500 samples.

Genome Biol. 2014 Aug 28;15(8):431. doi: 10.1186/s13059-014-0431-1.

The shaping and functional consequences of the microRNA landscape in breast cancer.

Nature. 2013 May 16;497(7449):378-82. doi: 10.1038/nature12108. Epub 2013 May 5.

A new genome-driven integrated classification of breast cancer and its implications.

EMBO J. 2013 Mar 6;32(5):617-28. doi: 10.1038/emboj.2013.19. Epub 2013 Feb 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

开发和验证一种基于可靠 DNA 拷贝数的机器学习算法（CopyClust），用于乳腺癌综合聚类分类。

Development and validation of a reliable DNA copy-number-based machine learning algorithm (CopyClust) for breast cancer integrative cluster classification.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献