Okada Yoshifumi, Sahara Takehiko, Mitsubayashi Hikaru, Ohgiya Satoru, Nagashima Tomomasa
Satellite Venture Business Laboratory, Muroran Institute of Technology, 27-1 Mizumoto-cho, Muroran, Hokkaido 050-8585, Japan.
Artif Intell Med. 2005 Sep-Oct;35(1-2):171-83. doi: 10.1016/j.artmed.2005.02.007.
DNA microarray technology has made it possible to determine the expression levels of thousands of genes in parallel under multiple experimental conditions. Genome-wide analyses using DNA microarrays make a great contribution to the exploration of the dynamic state of genetic networks, and further lead to the development of new disease diagnosis technologies. An important step in the analysis of gene expression data is to classify genes with similar expression patterns into the same groups. To this end, hierarchical clustering algorithms have been widely used. Major advantages of hierarchical clustering algorithms are that investigators do not need to specify the number of clusters in advance and results are presented visually in the form of a dendrogram. However, since traditional hierarchical clustering methods simply provide results on the statistical characteristics of expression data, biological interpretations of the resulting clusters are not easy, and it requires laborious tasks to unveil hidden biological processes regulated by members in the clusters. Therefore, it has been a very difficult routine for experts.
Here, we propose a novel algorithm in which cluster boundaries are determined by referring to functional annotations stored in genome databases.
The algorithm first performs hierarchical clustering of gene expression profiles. Then, the cluster boundaries are determined by the Variance Inflation Factor among the Gene Function Vectors, which represents distributions of gene functions in each cluster. Our algorithm automatically specifies a cutoff that leads to functionally independent agglomerations of genes on the dendrogram derived from similarities among gene expression patterns. Finally, each cluster is annotated according to dominant gene functions within the respective cluster.
In this paper, we apply our algorithm to two gene expression datasets related to cell cycle and cold stress response in budding yeast Saccharomyces cerevisiae. As a result, we show that the algorithm enables us to recognize cluster boundaries characterizing fundamental biological processes such as the Early G1, Late G1, S, G2 and M phases in cell cycles, and also provides novel annotation information that has not been obtained by traditional hierarchical clustering methods. In addition, using formal cluster validity indices, high validity of our algorithm is verified by the comparison through other popular clustering algorithms, K-means, self-organizing map and AutoClass.
DNA微阵列技术使得在多种实验条件下并行测定数千个基因的表达水平成为可能。使用DNA微阵列进行全基因组分析对探索遗传网络的动态状态有很大贡献,并进一步推动了新疾病诊断技术的发展。基因表达数据分析中的一个重要步骤是将具有相似表达模式的基因分类到同一组中。为此,层次聚类算法已被广泛使用。层次聚类算法的主要优点是研究人员无需预先指定聚类数量,并且结果以树状图的形式直观呈现。然而,由于传统的层次聚类方法仅提供关于表达数据统计特征的结果,对所得聚类进行生物学解释并不容易,并且需要费力的工作来揭示聚类中成员所调控的隐藏生物学过程。因此,这对专家来说一直是一项非常困难的常规工作。
在此,我们提出一种新颖的算法,其中聚类边界通过参考存储在基因组数据库中的功能注释来确定。
该算法首先对基因表达谱进行层次聚类。然后,通过基因功能向量之间的方差膨胀因子来确定聚类边界,该因子表示每个聚类中基因功能的分布。我们的算法会自动指定一个截止值,该截止值会导致基于基因表达模式相似性得出的树状图上基因的功能独立聚集。最后,根据各个聚类中占主导地位的基因功能对每个聚类进行注释。
在本文中,我们将我们的算法应用于与芽殖酵母酿酒酵母细胞周期和冷应激反应相关的两个基因表达数据集。结果表明,该算法使我们能够识别表征细胞周期中诸如G1早期、G1晚期、S期、G2期和M期等基本生物学过程的聚类边界,并且还提供了传统层次聚类方法未获得的新注释信息。此外,使用形式化的聚类有效性指标,通过与其他流行的聚类算法K均值、自组织映射和自动分类进行比较,验证了我们算法的高有效性。