moCluster：识别多个组学数据集的联合模式。

moCluster: Identifying Joint Patterns Across Multiple Omics Data Sets.

作者信息

Meng Chen, Helm Dominic, Frejno Martin, Kuster Bernhard

机构信息

Department of Oncology, University of Oxford , Oxford OX3 7DQ, United Kingdom.

Center for Integrated Protein Science Munich (CIPSM) , Emil-Erlenmeyer-Forum 5, Freising 85354, Germany.

出版信息

J Proteome Res. 2016 Mar 4;15(3):755-65. doi: 10.1021/acs.jproteome.5b00824. Epub 2015 Dec 30.

DOI:10.1021/acs.jproteome.5b00824

PMID:26653205

Abstract

Increasingly, multiple omics approaches are being applied to understand the complexity of biological systems. Yet, computational approaches that enable the efficient integration of such data are not well developed. Here, we describe a novel algorithm, termed moCluster, which discovers joint patterns among multiple omics data. The method first employs a multiblock multivariate analysis to define a set of latent variables representing joint patterns across input data sets, which is further passed to an ordinary clustering algorithm in order to discover joint clusters. Using simulated data, we show that moCluster's performance is not compromised by issues present in iCluster/iCluster+ (notably, the nondeterministic solution) and that it operates 100× to 1000× faster than iCluster/iCluster+. We used moCluster to cluster proteomic and transcriptomic data from the NCI-60 cell line panel. The resulting cluster model revealed different phenotypes across cellular subtypes, such as doubling time and drug response. Applying moCluster to methylation, mRNA, and protein data from a large study on colorectal cancer patients identified four molecular subtypes, including one characterized by microsatellite instability and high expression of genes/proteins involved in immunity, such as PDL1, a target of multiple drugs currently in development. The other three subtypes have not been discovered before using single data sets, which clearly illustrates the molecular complexity of oncogenesis and the need for holistic, multidata analysis strategies.

摘要

越来越多的多组学方法被用于理解生物系统的复杂性。然而，能够有效整合此类数据的计算方法尚未得到充分发展。在此，我们描述了一种名为moCluster的新型算法，它可以发现多组学数据之间的联合模式。该方法首先采用多块多变量分析来定义一组潜在变量，这些变量代表了输入数据集中的联合模式，然后将其进一步传递给普通聚类算法以发现联合聚类。通过模拟数据，我们表明moCluster的性能不受iCluster/iCluster+中存在的问题（特别是非确定性解决方案）的影响，并且其运行速度比iCluster/iCluster+快100到1000倍。我们使用moCluster对NCI - 60细胞系面板的蛋白质组学和转录组学数据进行聚类。所得的聚类模型揭示了不同细胞亚型之间的不同表型，如倍增时间和药物反应。将moCluster应用于一项关于结直肠癌患者的大型研究中的甲基化、mRNA和蛋白质数据，确定了四种分子亚型，其中一种以微卫星不稳定性和参与免疫的基因/蛋白质（如目前正在开发的多种药物的靶点PDL1）的高表达为特征。其他三种亚型在之前使用单一数据集时尚未被发现，这清楚地说明了肿瘤发生的分子复杂性以及整体多数据分析策略的必要性。