Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-Gu, Seoul, 151-747, Republic of Korea.
Department of Computer Science and Engineering.
Bioinformatics. 2017 Dec 1;33(23):3827-3835. doi: 10.1093/bioinformatics/btw780.
Identifying biologically meaningful gene expression patterns from time series gene expression data is important to understand the underlying biological mechanisms. To identify significantly perturbed gene sets between different phenotypes, analysis of time series transcriptome data requires consideration of time and sample dimensions. Thus, the analysis of such time series data seeks to search gene sets that exhibit similar or different expression patterns between two or more sample conditions, constituting the three-dimensional data, i.e. gene-time-condition. Computational complexity for analyzing such data is very high, compared to the already difficult NP-hard two dimensional biclustering algorithms. Because of this challenge, traditional time series clustering algorithms are designed to capture co-expressed genes with similar expression pattern in two sample conditions.
We present a triclustering algorithm, TimesVector, specifically designed for clustering three-dimensional time series data to capture distinctively similar or different gene expression patterns between two or more sample conditions. TimesVector identifies clusters with distinctive expression patterns in three steps: (i) dimension reduction and clustering of time-condition concatenated vectors, (ii) post-processing clusters for detecting similar and distinct expression patterns and (iii) rescuing genes from unclassified clusters. Using four sets of time series gene expression data, generated by both microarray and high throughput sequencing platforms, we demonstrated that TimesVector successfully detected biologically meaningful clusters of high quality. TimesVector improved the clustering quality compared to existing triclustering tools and only TimesVector detected clusters with differential expression patterns across conditions successfully.
The TimesVector software is available at http://biohealth.snu.ac.kr/software/TimesVector/.
Supplementary data are available at Bioinformatics online.
从时间序列基因表达数据中识别有生物学意义的基因表达模式对于理解潜在的生物学机制非常重要。为了识别不同表型之间显著扰动的基因集,需要考虑时间和样本维度来分析时间序列转录组数据。因此,此类时间序列数据的分析旨在搜索在两个或更多样本条件之间表现出相似或不同表达模式的基因集,构成三维数据,即基因-时间-条件。与已经困难的二维二分聚类算法相比,分析此类数据的计算复杂度非常高。由于这一挑战,传统的时间序列聚类算法旨在捕获具有相似表达模式的两个样本条件下共表达的基因。
我们提出了一种三聚类算法 TimesVector,专门用于聚类三维时间序列数据,以捕获两个或更多样本条件之间独特相似或不同的基因表达模式。TimesVector 通过三个步骤识别具有独特表达模式的聚类:(i)将时间-条件串联向量降维和聚类,(ii)对聚类进行后处理以检测相似和不同的表达模式,以及(iii)从未分类的聚类中恢复基因。使用四个时间序列基因表达数据集,这些数据集由微阵列和高通量测序平台生成,我们证明了 TimesVector 成功地检测到了高质量的生物学有意义的聚类。与现有的三聚类工具相比,TimesVector 提高了聚类质量,并且只有 TimesVector 成功地检测到了条件之间具有差异表达模式的聚类。
TimesVector 软件可在 http://biohealth.snu.ac.kr/software/TimesVector/ 获得。
补充数据可在 Bioinformatics 在线获得。