Wang Zhuofan, Zhou Fangting, He Kejun, Ni Yang
The Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing 100872, China.
Department of Statistics, Texas A&M University, College Station TX 77843, USA.
Stat Interface. 2024;17(2):219-230. doi: 10.4310/23-sii790. Epub 2024 Feb 1.
The development of modern sequencing technologies provides great opportunities to measure gene expression of multiple tissues from different individuals. The three-way variation across genes, tissues, and individuals makes statistical inference a challenging task. In this paper, we propose a Bayesian multi-way clustering approach to cluster genes, tissues, and individuals simultaneously. The proposed model adaptively trichotomizes the observed data into three latent categories and uses a Bayesian hierarchical construction to further decompose the latent variables into lower-dimensional features, which can be interpreted as overlapping clusters. With a Bayesian nonparametric prior, i.e., the Indian buffet process, our method determines the cluster number automatically. The utility of our approach is demonstrated through simulation studies and an application to the Genotype-Tissue Expression (GTEx) RNA-seq data. The clustering result reveals some interesting findings about depression-related genes in human brain, which are also consistent with biological domain knowledge. The detailed algorithm and some numerical results are available in the online Supplementary Material, http://intlpress.com/site/pub/files/-supp/sii/2024/0017/0002/sii-2024-0017-0002-s001.pdf.
现代测序技术的发展为测量来自不同个体的多个组织的基因表达提供了巨大机遇。基因、组织和个体之间的三维变异使得统计推断成为一项具有挑战性的任务。在本文中,我们提出了一种贝叶斯多向聚类方法,以同时对基因、组织和个体进行聚类。所提出的模型将观测数据自适应地三分法划分为三个潜在类别,并使用贝叶斯层次结构将潜在变量进一步分解为低维特征,这些特征可解释为重叠聚类。通过贝叶斯非参数先验,即印度自助餐过程,我们的方法自动确定聚类数量。通过模拟研究以及对基因型-组织表达(GTEx)RNA测序数据的应用,证明了我们方法的实用性。聚类结果揭示了关于人类大脑中与抑郁症相关基因的一些有趣发现,这也与生物学领域知识一致。详细算法和一些数值结果可在在线补充材料中获取,网址为http://intlpress.com/site/pub/files/-supp/sii/2024/0017/0002/sii-2024-0017-0002-s001.pdf。