Knowledge Discovery Group, Institute for Information Technology, National Research Council Canada, 1200 Montréal Road, Ottawa, ON K1A 0R6, Canada.
BMC Bioinformatics. 2012 Apr 4;13:54. doi: 10.1186/1471-2105-13-54.
Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space.
We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (Plasmodium chabaudi), systemic acquired resistance in Arabidopsis thaliana, similarities and differences between inner and outer cotyledon in Brassica napus during seed development, and to Brassica napus whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples.
Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.
如今,人们可以在一系列时间点从一组生物样本中收集一组基因的表达水平。这样的数据有三个维度:基因-样本-时间(GST)。因此,它们被称为 3D 微阵列基因表达数据。为了利用收集到的 3D 数据,并充分理解 GST 数据中隐藏的生物学知识,必须开发新的子空间聚类算法,以便在相应的空间中有效地解决生物学问题。
我们开发了一种称为有序保持三聚类(OPTricluster)的子空间聚类算法,用于 3D 短时间序列数据挖掘。OPTricluster 能够使用组合方法在样本维度上,以及在时间维度上的有序保持(OP)概念,从给定的 3D 数据集识别出具有一致演化的 3D 聚类。这两种方法的融合使得能够根据其时间表达谱研究样本之间的相似性和差异性。OPTricluster 已成功应用于四个案例研究:感染疟原虫(Plasmodium chabaudi)的小鼠的免疫反应、拟南芥的系统获得性抗性、油菜种子发育过程中外胚叶的相似性和差异性,以及油菜种子的整个发育过程。这些研究表明,OPTricluster 对噪声具有鲁棒性,能够检测生物样本之间的相似性和差异性。
我们的分析表明,OPTricluster 通常优于其他知名聚类算法,如 TRICLUSTER、gTRICLUSTER 和 K-means;它对噪声具有鲁棒性,可以有效地挖掘 3D 短时间序列基因表达数据中隐藏的生物学知识。