Sun Jiehuan, Warren Joshua L, Zhao Hongyu
.
Stat Appl Genet Mol Biol. 2017 Apr 25;16(2):145-158. doi: 10.1515/sagmb-2016-0051.
Disease subtype identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to infer disease subtypes, which often lead to biologically meaningful insights into disease. Despite many successes, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering due to the high dimensionality. In this article, we introduce a novel subtype identification method in the Bayesian setting based on gene expression profiles. This method, called BCSub, adopts an innovative semiparametric Bayesian factor analysis model to reduce the dimension of the data to a few factor scores for clustering. Specifically, the factor scores are assumed to follow the Dirichlet process mixture model in order to induce clustering. Through extensive simulation studies, we show that BCSub has improved performance over commonly used clustering methods. When applied to two gene expression datasets, our model is able to identify subtypes that are clinically more relevant than those identified from the existing methods.
疾病亚型识别(聚类)是生物医学研究中的一个重要问题。基因表达谱通常用于推断疾病亚型,这往往能带来对疾病具有生物学意义的见解。尽管取得了许多成功,但当基因高度相关且由于高维性而包含许多无信息基因用于聚类时,现有的聚类方法可能表现不佳。在本文中,我们基于基因表达谱在贝叶斯框架下介绍一种新颖的亚型识别方法。这种方法称为BCSub,采用创新的半参数贝叶斯因子分析模型将数据维度降至几个因子得分用于聚类。具体而言,假设因子得分遵循狄利克雷过程混合模型以进行聚类。通过广泛的模拟研究,我们表明BCSub比常用的聚类方法具有更好的性能。当应用于两个基因表达数据集时,我们的模型能够识别出比现有方法识别出的更具临床相关性的亚型。