MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK.
Cambridge Institute of Therapeutic Immunology & Infectious Disease, University of Cambridge, Cambridge CB2 0AW, UK.
Bioinformatics. 2020 Sep 15;36(18):4789-4796. doi: 10.1093/bioinformatics/btaa593.
Diverse applications-particularly in tumour subtyping-have demonstrated the importance of integrative clustering techniques for combining information from multiple data sources. Cluster Of Clusters Analysis (COCA) is one such approach that has been widely applied in the context of tumour subtyping. However, the properties of COCA have never been systematically explored, and its robustness to the inclusion of noisy datasets is unclear.
We rigorously benchmark COCA, and present Kernel Learning Integrative Clustering (KLIC) as an alternative strategy. KLIC frames the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering. This allows the contribution of noisy datasets to be down-weighted relative to more informative datasets. We compare the performances of KLIC and COCA in a variety of situations through simulation studies. We also present the output of KLIC and COCA in real data applications to cancer subtyping and transcriptional module discovery.
R packages klic and coca are available on the Comprehensive R Archive Network.
Supplementary data are available at Bioinformatics online.
多种应用——特别是在肿瘤分型方面——已经证明了整合聚类技术对于结合来自多个数据源的信息的重要性。聚类簇分析(COCA)就是这样一种方法,它在肿瘤分型方面得到了广泛的应用。然而,COCA 的性质从未被系统地探索过,其对包含噪声数据集的稳健性也不清楚。
我们严格地对 COCA 进行基准测试,并提出了核学习集成聚类(KLIC)作为替代策略。KLIC 将组合聚类结构的挑战表述为一个多核学习问题,其中不同的数据集各自对最终聚类提供加权贡献。这使得噪声数据集的贡献相对于更具信息量的数据集被降低权重。我们通过模拟研究比较了 KLIC 和 COCA 在各种情况下的性能。我们还在癌症分型和转录模块发现的真实数据应用中展示了 KLIC 和 COCA 的输出。
R 包 klic 和 coca 可在 Comprehensive R Archive Network 上获得。
补充数据可在 Bioinformatics 在线获得。