Žurauskienė Justina, Kirk Paul D W, Stumpf Michael P H
Stat Appl Genet Mol Biol. 2016 Apr;15(2):107-22. doi: 10.1515/sagmb-2016-0016.
The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. Therefore, it is increasingly being recognized that we can gain deeper understanding about underlying biology by combining the insights obtained from multiple, diverse datasets. Thus we propose a novel scalable computational approach to unsupervised data fusion. Our technique exploits network representations of the data to identify similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modeling techniques. An advantage of the proposed approach is that each dataset can initially be modeled independently (in parallel), before applying a fast post-processing step to perform data integration. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. We first demonstrate the applicability of our tool on artificial data, and then on examples from the literature, which include yeast cell cycle, breast cancer and sporadic inclusion body myositis datasets.
高通量实验技术的快速发展导致产生并需要分析的基因组数据集的多样性不断增加。因此,人们越来越认识到,通过结合从多个不同数据集中获得的见解,我们可以更深入地了解潜在生物学。因此,我们提出了一种新颖的可扩展计算方法用于无监督数据融合。我们的技术利用数据的网络表示来识别数据集之间的相似性。我们可以在贝叶斯形式体系内工作,使用贝叶斯非参数方法对每个数据集进行建模;或者(对于快速、近似和大规模数据融合)可以自然地切换到更启发式的建模技术。所提出方法的一个优点是,在应用快速后处理步骤进行数据整合之前,每个数据集最初可以独立(并行)建模。这使我们能够以在线方式纳入新的实验数据,而无需重新运行所有分析。我们首先在人工数据上展示我们工具的适用性,然后在文献中的示例上进行展示,这些示例包括酵母细胞周期、乳腺癌和散发性包涵体肌炎数据集。