Koestler Devin C, Marsit Carmen J, Christensen Brock C, Kelsey Karl T, Houseman E Andres
Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS 66160, USA.
Department of Community and Family Medicine, Section for Biostatistics and Epidemiology, Dartmouth Medical School, Hanover, New Hampshire 03756, USA ; Department of Pharmacology and Toxicology, Dartmouth College, Hanover, NH 03756, USA.
Transl Cancer Res. 2014;3(3):217-232. doi: 10.3978/j.issn.2218-676X.2014.06.04.
Longitudinally collected gene expression data provides an opportunity to investigate the dynamic behavior of gene expression and is crucial for establishing causal links between changes on a molecular level and disease development and progression. In terms of the analysis of such data, clustering of subjects based on time-course expression data may improve our understanding of temporal expression patterns that result in disease phenotypes. Although there are numerous existing methods for clustering subjects using gene expression data, most are not suitable when expression measurements are repeatedly collected over a time-course.
We present a modified version of the recursively partitioned mixture model (RPMM) for clustering subjects based on longitudinally collected gene expression data. In the proposed time-course RPMM (TC-RPMM), subjects are clustered on the basis of their temporal profiles of gene expression using a mixture of mixed effects models framework. This framework captures changes in gene expression over time and models the autocorrelation between repeated gene expression measurements for the same subject. We assessed the performance of TC-RPMM using extensive simulation studies and a dataset from a multi-center research study of inflammation and response to injury (www.gluegrant.org), which consisted of time-course gene expression data for 140 subjects.
Our simulation studies encompassed several different scenarios and were aimed at assessing the ability of TC-RPMM to correctly recover true class memberships when the expression trajectories that characterized those classes differed. Overall, our simulation studies revealed favorable performance of TC-RPMM compared to competing approaches, however clustering performance was observed to be highly dependent on the proportion of class discriminating genes used in clustering analysis. When applied to real epidemiologic data with repeated-measures, longitudinal gene expression measurements, TC-RPMM identified clusters that had strong biological and clinical significance.
Methods for clustering subjects based on temporal gene expression profiles is a high priority for molecular biology and bioinformatics research. Along these lines, the proposed TC-RPMM represents a promising new approach for analyzing time-course gene expression data.
纵向收集的基因表达数据为研究基因表达的动态行为提供了契机,对于在分子水平上的变化与疾病发展及进展之间建立因果联系至关重要。就此类数据分析而言,基于时间进程表达数据对研究对象进行聚类,可能会增进我们对导致疾病表型的时间表达模式的理解。尽管现有众多利用基因表达数据对研究对象进行聚类的方法,但当在一个时间进程中反复收集表达测量值时,大多数方法并不适用。
我们提出了递归划分混合模型(RPMM)的一个修改版本,用于基于纵向收集的基因表达数据对研究对象进行聚类。在所提出的时间进程RPMM(TC-RPMM)中,利用混合效应模型框架,根据研究对象基因表达的时间概况对其进行聚类。该框架捕捉基因表达随时间的变化,并对同一研究对象重复基因表达测量值之间的自相关性进行建模。我们通过广泛的模拟研究以及来自一项关于炎症和损伤反应的多中心研究(www.gluegrant.org)的数据集评估了TC-RPMM的性能,该数据集包含140名研究对象的时间进程基因表达数据。
我们的模拟研究涵盖了几种不同的情况,旨在评估当表征这些类别的表达轨迹不同时,TC-RPMM正确恢复真实类别归属的能力。总体而言,我们的模拟研究表明,与竞争方法相比,TC-RPMM具有良好的性能,然而观察到聚类性能高度依赖于聚类分析中使用的类别区分基因的比例。当应用于具有重复测量的纵向基因表达测量的实际流行病学数据时,TC-RPMM识别出具有很强生物学和临床意义的聚类。
基于时间基因表达概况对研究对象进行聚类的方法是分子生物学和生物信息学研究的高度优先事项。就此而言,所提出的TC-RPMM代表了一种分析时间进程基因表达数据的有前景的新方法。