Ghouila Amel, Yahia Sadok Ben, Malouche Dhafer, Jmel Haifa, Laouini Dhafer, Guerfali Fatma Z, Abdelhak Sonia
Research Unit on Molecular Investigation of Genetic Orphan Diseases, Institut Pasteur de Tunis, 13 Place Pasteur, BP 74, Tunis Belvédère 1002, Tunisia.
Infect Genet Evol. 2009 May;9(3):328-36. doi: 10.1016/j.meegid.2008.09.009. Epub 2008 Oct 17.
The production of increasingly reliable and accessible gene expression data has stimulated the development of computational tools to interpret such data and to organize them efficiently. The clustering techniques are largely recognized as useful exploratory tools for gene expression data analysis. Genes that show similar expression patterns over a wide range of experimental conditions can be clustered together. This relies on the hypothesis that genes that belong to the same cluster are coregulated and involved in related functions. Nevertheless, clustering algorithms still show limits, particularly for the estimation of the number of clusters and the interpretation of hierarchical dendrogram, which may significantly influence the outputs of the analysis process. We propose here a multi level SOM based clustering algorithm named Multi-SOM. Through the use of clustering validity indices, Multi-SOM overcomes the problem of the estimation of clusters number. To test the validity of the proposed clustering algorithm, we first tested it on supervised training data sets. Results were evaluated by computing the number of misclassified samples. We have then used Multi-SOM for the analysis of macrophage gene expression data generated in vitro from the same individual blood infected with 5 different pathogens. This analysis led to the identification of sets of tightly coregulated genes across different pathogens. Gene Ontology tools were then used to estimate the biological significance of the clustering, which showed that the obtained clusters are coherent and biologically significant.
越来越可靠且易于获取的基因表达数据的产生,刺激了用于解释此类数据并对其进行有效组织的计算工具的发展。聚类技术在很大程度上被认为是用于基因表达数据分析的有用探索工具。在广泛的实验条件下显示出相似表达模式的基因可以被聚类在一起。这依赖于这样的假设,即属于同一聚类的基因是共调控的且参与相关功能。然而,聚类算法仍然存在局限性,特别是在聚类数量的估计和层次树状图的解释方面,这可能会显著影响分析过程的输出。我们在此提出一种基于多层自组织映射的聚类算法,名为Multi - SOM。通过使用聚类有效性指标,Multi - SOM克服了聚类数量估计的问题。为了测试所提出聚类算法的有效性,我们首先在监督训练数据集上对其进行测试。通过计算错误分类样本的数量来评估结果。然后我们使用Multi - SOM对来自同一个体感染5种不同病原体的血液体外产生的巨噬细胞基因表达数据进行分析。该分析导致识别出跨不同病原体的紧密共调控基因集。然后使用基因本体工具来估计聚类的生物学意义,结果表明所获得的聚类是连贯的且具有生物学意义。