Li Yang, Jourdain Alexis A, Calvo Sarah E, Liu Jun S, Mootha Vamsi K
Howard Hughes Medical Institute and Department of Molecular Biology and the Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, United States of America and Department of Systems Biology, Harvard Medical School, Boston, MA United States of America.
Department of Statistics, Harvard University, Cambridge, MA, United States of America.
PLoS Comput Biol. 2017 Jul 18;13(7):e1005653. doi: 10.1371/journal.pcbi.1005653. eCollection 2017 Jul.
In recent years, there has been a huge rise in the number of publicly available transcriptional profiling datasets. These massive compendia comprise billions of measurements and provide a special opportunity to predict the function of unstudied genes based on co-expression to well-studied pathways. Such analyses can be very challenging, however, since biological pathways are modular and may exhibit co-expression only in specific contexts. To overcome these challenges we introduce CLIC, CLustering by Inferred Co-expression. CLIC accepts as input a pathway consisting of two or more genes. It then uses a Bayesian partition model to simultaneously partition the input gene set into coherent co-expressed modules (CEMs), while assigning the posterior probability for each dataset in support of each CEM. CLIC then expands each CEM by scanning the transcriptome for additional co-expressed genes, quantified by an integrated log-likelihood ratio (LLR) score weighted for each dataset. As a byproduct, CLIC automatically learns the conditions (datasets) within which a CEM is operative. We implemented CLIC using a compendium of 1774 mouse microarray datasets (28628 microarrays) or 1887 human microarray datasets (45158 microarrays). CLIC analysis reveals that of 910 canonical biological pathways, 30% consist of strongly co-expressed gene modules for which new members are predicted. For example, CLIC predicts a functional connection between protein C7orf55 (FMC1) and the mitochondrial ATP synthase complex that we have experimentally validated. CLIC is freely available at www.gene-clic.org. We anticipate that CLIC will be valuable both for revealing new components of biological pathways as well as the conditions in which they are active.
近年来,公开可用的转录谱数据集数量大幅增加。这些海量数据集包含数十亿次测量结果,为基于与已深入研究的通路的共表达来预测未研究基因的功能提供了特殊机会。然而,此类分析可能极具挑战性,因为生物通路是模块化的,可能仅在特定背景下表现出共表达。为克服这些挑战,我们引入了CLIC(通过推断共表达进行聚类)。CLIC接受由两个或更多基因组成的通路作为输入。然后,它使用贝叶斯划分模型将输入基因集同时划分为连贯的共表达模块(CEM),同时为支持每个CEM的每个数据集分配后验概率。CLIC随后通过在转录组中扫描其他共表达基因来扩展每个CEM,这些基因通过针对每个数据集加权的综合对数似然比(LLR)得分进行量化。作为副产品,CLIC自动了解CEM起作用的条件(数据集)。我们使用包含1774个小鼠微阵列数据集(28628个微阵列)或1887个人类微阵列数据集(45158个微阵列)的数据集实现了CLIC。CLIC分析表明,在910条经典生物通路中,30%由预测有新成员的强共表达基因模块组成。例如,CLIC预测了蛋白质C7orf55(FMC1)与线粒体ATP合酶复合体之间的功能联系,我们已通过实验验证了这一点。CLIC可在www.gene-clic.org免费获取。我们预计CLIC对于揭示生物通路的新组成部分以及它们活跃的条件都将很有价值。