Meng Jia, Gao Shou-Jiang, Huang Yufei
Department of ECE, University of Texas at San Antonio, Texas, USA.
Bioinformatics. 2009 Jun 15;25(12):1521-7. doi: 10.1093/bioinformatics/btp235. Epub 2009 Apr 7.
Clustering is a popular data exploration technique widely used in microarray data analysis. When dealing with time-series data, most conventional clustering algorithms, however, either use one-way clustering methods, which fail to consider the heterogeneity of temporary domain, or use two-way clustering methods that do not take into account the time dependency between samples, thus producing less informative results. Furthermore, enrichment analysis is often performed independent of and after clustering and such practice, though capable of revealing biological significant clusters, cannot guide the clustering to produce biologically significant result.
We present a new enrichment constrained framework (ECF) coupled with a time-dependent iterative signature algorithm (TDISA), which, by applying a sliding time window to incorporate the time dependency of samples and imposing an enrichment constraint to parameters of clustering, allows supervised identification of temporal transcription modules (TTMs) that are biologically meaningful. Rigorous mathematical definitions of TTM as well as the enrichment constraint framework are also provided that serve as objective functions for retrieving biologically significant modules. We applied the enrichment constrained time-dependent iterative signature algorithm (ECTDISA) to human gene expression time-series data of Kaposi's sarcoma-associated herpesvirus (KSHV) infection of human primary endothelial cells; the result not only confirms known biological facts, but also reveals new insight into the molecular mechanism of KSHV infection.
Data and Matlab code are available at http://engineering.utsa.edu/ approximately yfhuang/ECTDISA.html.
Supplementary data are available at Bioinformatics online.
聚类是一种流行的数据探索技术,广泛应用于微阵列数据分析。然而,在处理时间序列数据时,大多数传统的聚类算法要么使用单向聚类方法,这种方法无法考虑临时域的异质性,要么使用双向聚类方法,这种方法没有考虑样本之间的时间依赖性,从而产生的信息较少。此外,富集分析通常在聚类之后独立进行,这种做法虽然能够揭示具有生物学意义的聚类,但不能指导聚类产生具有生物学意义的结果。
我们提出了一种新的富集约束框架(ECF),并结合了一种时间依赖迭代签名算法(TDISA),该算法通过应用滑动时间窗口来纳入样本的时间依赖性,并对聚类参数施加富集约束,从而能够监督识别具有生物学意义的时间转录模块(TTM)。我们还提供了TTM以及富集约束框架的严格数学定义,作为检索具有生物学意义模块的目标函数。我们将富集约束时间依赖迭代签名算法(ECTDISA)应用于人类原发性内皮细胞感染卡波西肉瘤相关疱疹病毒(KSHV)的人类基因表达时间序列数据;结果不仅证实了已知的生物学事实,还揭示了对KSHV感染分子机制的新见解。
数据和Matlab代码可在http://engineering.utsa.edu/ approximately yfhuang/ECTDISA.html获取。
补充数据可在《生物信息学》在线获取。