Suppr超能文献

三向计数数据的矩阵变量泊松对数正态分布的有限混合。

Finite mixtures of matrix variate Poisson-log normal distributions for three-way count data.

机构信息

Department of Mathematics and Statistics, University of Guelph, Guelph, ON N1G 2W1, Canada.

Department of Molecular and Cellular Biology, University of Guelph, Guelph, ON N1G 2W1, Canada.

出版信息

Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad167.

Abstract

MOTIVATION

Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks.

RESULTS

In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo-based approach, a variational Gaussian approximation-based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.

AVAILABILITY AND IMPLEMENTATION

The GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN and is released under the open source MIT license.

摘要

动机

三向数据结构,其特征是三个实体,即单位、变量和场合,在生物研究中很常见。在 RNA 测序中,当在 r 个场合下对 n 个基因进行 p 个条件的高通量转录组测序数据收集时,就会得到三向数据结构。矩阵变量分布为三向数据建模提供了一种自然的方法,并且矩阵变量分布的混合可以用于聚类三向数据。基因表达数据的聚类是通过发现基因共表达网络来实现的。

结果

在这项工作中,提出了一种用于聚类 RNA 测序读计数的矩阵变量泊松对数正态分布混合模型。通过考虑矩阵变量结构,同时考虑了 RNA 测序数据集的条件和场合的全部信息,并减少了要估计的协方差参数的数量。我们提出了三种不同的参数估计框架:基于马尔可夫链蒙特卡罗的方法、基于变分高斯逼近的方法和混合方法。使用各种信息准则进行模型选择。该模型应用于真实和模拟数据,我们证明了该方法可以在这两种情况下都能恢复潜在的聚类结构。在真实模型参数已知的模拟研究中,我们提出的方法显示出良好的参数恢复能力。

可用性和实现

这项工作的 GitHub R 包可在 https://github.com/anjalisilva/mixMVPLN 上获得,并根据开放源代码 MIT 许可证发布。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a94/10159656/cff721791422/btad167f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验