Oslo Centre for Biostatistics and Epidemiology, Oslo University Hospital, Oslo, Norway.
Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway.
Bioinformatics. 2019 Dec 1;35(23):4886-4897. doi: 10.1093/bioinformatics/btz381.
Unsupervised clustering is important in disease subtyping, among having other genomic applications. As genomic data has become more multifaceted, how to cluster across data sources for more precise subtyping is an ever more important area of research. Many of the methods proposed so far, including iCluster and Cluster of Cluster Assignments (COCAs), make an unreasonable assumption of a common clustering across all data sources, and those that do not are fewer and tend to be computationally intensive.
We propose a Bayesian parametric model for integrative, unsupervised clustering across data sources. In our two-way latent structure model, samples are clustered in relation to each specific data source, distinguishing it from methods like COCAs and iCluster, but cluster labels have across-dataset meaning, allowing cluster information to be shared between data sources. A common scaling across data sources is not required, and inference is obtained by a Gibbs Sampler, which we improve with a warm start strategy and modified density functions to robustify and speed convergence. Posterior interpretation allows for inference on common clusterings occurring among subsets of data sources. An interesting statistical formulation of the model results in sampling from closed-form posteriors despite incorporation of a complex latent structure. We fit the model with Gaussian and more general densities, which influences the degree of across-dataset cluster label sharing. Uniquely among integrative clustering models, our formulation makes no nestedness assumptions of samples across data sources so that a sample missing data from one genomic source can be clustered according to its existing data sources. We apply our model to a Norwegian breast cancer cohort of ductal carcinoma in situ and invasive tumors, comprised of somatic copy-number alteration, methylation and expression datasets. We find enrichment in the Her2 subtype and ductal carcinoma among those observations exhibiting greater cluster correspondence across expression and CNA data. In general, there are few pan-genomic clusterings, suggesting that models assuming a common clustering across genomic data sources might yield misleading results.
The model is implemented in an R package called twl ('two-way latent'), available on CRAN. Data for analysis are available within the R package.
Supplementary data are available at Bioinformatics online.
无监督聚类在疾病亚型分类中很重要,在其他基因组应用中也是如此。随着基因组数据变得更加多样化,如何跨数据源进行聚类以实现更精确的亚型分类是一个越来越重要的研究领域。到目前为止,许多提出的方法,包括 iCluster 和 Cluster of Cluster Assignments(COCAs),都对所有数据源的共同聚类做出了不合理的假设,而不这样做的方法则更少,而且往往计算密集度更高。
我们提出了一种用于跨数据源集成、无监督聚类的贝叶斯参数模型。在我们的双向潜在结构模型中,样本根据每个特定数据源进行聚类,与 COCAs 和 iCluster 等方法区分开来,但聚类标签具有跨数据集的含义,允许在数据源之间共享聚类信息。不需要跨数据源的共同缩放,通过 Gibbs Sampler 进行推断,我们通过预热策略和修改的密度函数来改进 Gibbs Sampler,以增强稳健性和加快收敛速度。后验解释允许对数据源子集之间发生的常见聚类进行推断。模型的有趣统计公式导致即使包含复杂的潜在结构,也可以从闭形式后验中进行采样。我们使用高斯和更一般的密度来拟合模型,这会影响跨数据集聚类标签共享的程度。在集成聚类模型中独一无二的是,我们的公式对数据源之间的样本没有嵌套假设,因此一个从一个基因组源丢失数据的样本可以根据其现有数据源进行聚类。我们将我们的模型应用于挪威乳腺癌队列的原位导管癌和浸润性肿瘤,包括体细胞拷贝数改变、甲基化和表达数据集。我们发现,在那些表现出更大的表达和 CNA 数据对应聚类的观察中,Her2 亚型和导管癌的富集。一般来说,很少有泛基因组聚类,这表明假设跨基因组数据源存在共同聚类的模型可能会产生误导性结果。
该模型在一个名为 twl(“双向潜在”)的 R 包中实现,可在 CRAN 上获得。分析数据可在 R 包内获得。
补充数据可在 Bioinformatics 在线获得。