Zarean Elaheh, Li Shuai, Wong Ee Ming, Makalic Enes, Milne Roger L, Giles Graham G, McLean Catriona, Southey Melissa C, Dugué Pierre-Antoine
Precision Medicine, School of Clinical Sciences at Monash Health, Monash University, Clayton, VIC, Australia.
Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, Australia.
Epigenomics. 2025 Feb;17(2):105-114. doi: 10.1080/17501911.2024.2441653. Epub 2024 Dec 23.
Clustering algorithms have been widely applied to tumor DNA methylation datasets to define methylation-based cancer subtypes. This study aimed to evaluate the agreement between subtypes obtained from common clustering strategies.
MATERIALS & METHODS: We used tumor DNA methylation data from 409 women with breast cancer from the Melbourne Collaborative Cohort Study (MCCS) and 781 breast tumors from The Cancer Genome Atlas (TCGA). Agreement was assessed using the adjusted Rand index for various combinations of number of CpGs, number of clusters and clustering algorithms (hierarchical, K-means, partitioning around medoids, and recursively partitioned mixture models).
Inconsistent agreement patterns were observed for between-algorithm and within-algorithm comparisons, with generally poor to moderate agreement (ARI <0.7). Results were qualitatively similar in the MCCS and TCGA, showing better agreement for moderate number of CpGs and fewer clusters (K = 2). Restricting the analysis to CpGs that were differentially-methylated between tumor and normal tissue did not result in higher agreement.
Our study highlights that common clustering strategies involving an arbitrary choice of algorithm, number of clusters and number of methylation sites are likely to identify different DNA methylation-based breast tumor subtypes.
聚类算法已广泛应用于肿瘤DNA甲基化数据集,以定义基于甲基化的癌症亚型。本研究旨在评估从常见聚类策略获得的亚型之间的一致性。
我们使用了来自墨尔本协作队列研究(MCCS)的409名乳腺癌女性的肿瘤DNA甲基化数据,以及来自癌症基因组图谱(TCGA)的781例乳腺肿瘤数据。使用调整后的兰德指数评估不同组合的CpG数量、聚类数量和聚类算法(层次聚类、K均值聚类、围绕中心点划分聚类和递归划分混合模型)之间的一致性。
在算法间和算法内比较中观察到不一致的一致性模式,一致性一般较差至中等(ARI<0.7)。MCCS和TCGA的结果在定性上相似,表明中等数量的CpG和较少的聚类数(K = 2)时一致性更好。将分析限制在肿瘤组织和正常组织之间差异甲基化的CpG上,并没有导致更高的一致性。
我们的研究强调,涉及算法、聚类数量和甲基化位点数量的任意选择的常见聚类策略可能会识别出不同的基于DNA甲基化的乳腺肿瘤亚型。