Žitnik Marinka, Nam Edward A, Dinh Christopher, Kuspa Adam, Shaulsky Gad, Zupan Blaž
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America.
PLoS Comput Biol. 2015 Oct 14;11(10):e1004552. doi: 10.1371/journal.pcbi.1004552. eCollection 2015 Oct.
Data integration procedures combine heterogeneous data sets into predictive models, but they are limited to data explicitly related to the target object type, such as genes. Collage is a new data fusion approach to gene prioritization. It considers data sets of various association levels with the prediction task, utilizes collective matrix factorization to compress the data, and chaining to relate different object types contained in a data compendium. Collage prioritizes genes based on their similarity to several seed genes. We tested Collage by prioritizing bacterial response genes in Dictyostelium as a novel model system for prokaryote-eukaryote interactions. Using 4 seed genes and 14 data sets, only one of which was directly related to the bacterial response, Collage proposed 8 candidate genes that were readily validated as necessary for the response of Dictyostelium to Gram-negative bacteria. These findings establish Collage as a method for inferring biological knowledge from the integration of heterogeneous and coarsely related data sets.
数据整合程序将异构数据集组合成预测模型,但它们仅限于与目标对象类型(如基因)明确相关的数据。Collage是一种用于基因优先级排序的新数据融合方法。它考虑与预测任务具有不同关联水平的数据集,利用集体矩阵分解来压缩数据,并通过链接来关联数据集中包含的不同对象类型。Collage根据基因与几个种子基因的相似性对基因进行优先级排序。我们通过将盘基网柄菌中的细菌反应基因作为原核生物 - 真核生物相互作用的新型模型系统进行优先级排序来测试Collage。使用4个种子基因和14个数据集(其中只有一个与细菌反应直接相关),Collage提出了8个候选基因,这些基因很容易被验证为盘基网柄菌对革兰氏阴性菌反应所必需的。这些发现确立了Collage作为一种从异构和粗略相关的数据集中整合来推断生物学知识的方法。