Jia Zhenyu, Xu Shizhong
Department of Botany and Plant Sciences, University of California, Riverside, 92521, USA.
Genet Res. 2005 Dec;86(3):193-207. doi: 10.1017/S0016672305007822.
Cluster analyses of gene expression data are usually conducted based on their associations with the phenotype of a particular disease. Many disease traits have a clearly defined binary phenotype (presence or absence), so that genes can be clustered based on the differences of expression levels between the two contrasting phenotypic groups. For example, cluster analysis based on binary phenotype has been successfully used in tumour research. Some complex diseases have phenotypes that vary in a continuous manner and the method developed for a binary trait is not immediately applicable to a continuous trait. However, understanding the role of gene expression in these complex traits is of fundamental importance. Therefore, it is necessary to develop a new statistical method to cluster expressed genes based on their association with a quantitative trait phenotype. We developed a model-based clustering method to classify genes based on their association with a continuous phenotype. We used a linear model to describe the relationship between gene expression and the phenotypic value. The model effects of the linear model (linear regression coefficients) represent the strength of the association. We assumed that the model effects of each gene follow a mixture of several multivariate Gaussian distributions. Parameter estimation and cluster assignment were accomplished via an Expectation-Maximization (EM) algorithm. The method was verified by analysing two simulated datasets, and further demonstrated using real data generated in a microarray experiment for the study of gene expression associated with Alzheimer's disease.
基因表达数据的聚类分析通常基于它们与特定疾病表型的关联来进行。许多疾病特征具有明确界定的二元表型(存在或不存在),这样基因就可以根据两个对比表型组之间表达水平的差异进行聚类。例如,基于二元表型的聚类分析已成功应用于肿瘤研究。一些复杂疾病具有以连续方式变化的表型,而为二元性状开发的方法不能直接应用于连续性状。然而,了解基因表达在这些复杂性状中的作用至关重要。因此,有必要开发一种新的统计方法,根据表达基因与数量性状表型的关联对其进行聚类。我们开发了一种基于模型的聚类方法,根据基因与连续表型的关联对基因进行分类。我们使用线性模型来描述基因表达与表型值之间的关系。线性模型的模型效应(线性回归系数)代表关联的强度。我们假设每个基因的模型效应遵循几种多元高斯分布的混合。参数估计和聚类分配通过期望最大化(EM)算法完成。该方法通过分析两个模拟数据集进行了验证,并使用在微阵列实验中生成的真实数据进一步证明,该实验用于研究与阿尔茨海默病相关的基因表达。