Žitnik Marinka, Zupan Blaž
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
Bioinformatics. 2015 Jun 15;31(12):i230-9. doi: 10.1093/bioinformatics/btv258.
Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets.
We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies.
Source code is at https://github.com/marinkaz/fusenet.
马尔可夫网络是无向图模型,广泛用于从实验数据中推断基因之间的关系。其最先进的推断程序假设数据来自高斯分布。高通量组学数据,如下一代测序产生的数据,常常违反这一假设。此外,当收集的数据来自多个相关但不完全相同的分布时,其潜在网络可能具有共同特征。需要新的有原则的统计方法,能够处理不同的数据分布并联合考虑数据集的集合。
我们提出了FuseNet,一种从非相同分布的数据集集合中推断网络的马尔可夫网络公式。我们的方法计算效率高且具有通用性:给定指数族中的任意数量的分布,FuseNet通过定义网络节点邻域的共享潜在因子来表示模型参数。在一项模拟研究中,与几种流行的图形模型相比,我们证明了FuseNet具有良好的预测性能。我们展示了它在乳腺癌RNA测序和体细胞突变数据应用中的有效性,这是图形模型的一种新应用。相对于为每个数据集推断单独的网络,数据集的融合带来了显著的收益。我们的结果表明,用于非高斯数据的网络推断方法有助于准确建模新兴高通量技术生成的数据。