Wang Junbai, Delabie Jan, Aasheim Hans, Smeland Erlend, Myklebost Ola
Department of Tumor Biology, Norwegian Radium Hospital, N0310 Oslo, Norway.
BMC Bioinformatics. 2002 Nov 24;3:36. doi: 10.1186/1471-2105-3-36.
A method to evaluate and analyze the massive data generated by series of microarray experiments is of utmost importance to reveal the hidden patterns of gene expression. Because of the complexity and the high dimensionality of microarray gene expression profiles, the dimensional reduction of raw expression data and the feature selections necessary for, for example, classification of disease samples remains a challenge. To solve the problem we propose a two-level analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that simplifies and reduces the dimensionality of original measurements and visualizes individual tumor sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to identify patterns of gene expression useful for classification of samples.
We tested the two-level analysis on public data from diffuse large B-cell lymphomas. The analysis easily distinguished major gene expression patterns without the need for supervision: a germinal center-related, a proliferation, an inflammatory and a plasma cell differentiation-related gene expression pattern. The first three patterns matched the patterns described in the original publication using supervised clustering analysis, whereas the fourth one was novel.
Our study shows that by using SOM as an intermediate step to analyze genome-wide gene expression data, the gene expression patterns can more easily be revealed. The "expression display" by the SOM component plane summarises the complicated data in a way that allows the clinician to evaluate the classification options rather than giving a fixed diagnosis.
评估和分析由一系列微阵列实验产生的海量数据的方法对于揭示基因表达的隐藏模式至关重要。由于微阵列基因表达谱的复杂性和高维度性,原始表达数据的降维和例如疾病样本分类所需的特征选择仍然是一个挑战。为了解决这个问题,我们提出了一种两级分析方法。首先使用自组织映射(SOM)。SOM是一种矢量量化方法,它简化并降低了原始测量的维度,并在SOM组件平面中可视化单个肿瘤样本。接下来,使用层次聚类和K均值聚类来识别有助于样本分类的基因表达模式。
我们在弥漫性大B细胞淋巴瘤的公共数据上测试了这种两级分析方法。该分析无需监督即可轻松区分主要基因表达模式:生发中心相关、增殖、炎症和浆细胞分化相关的基因表达模式。前三种模式与原始出版物中使用监督聚类分析描述的模式相匹配,而第四种模式是新发现的。
我们的研究表明,通过使用SOM作为分析全基因组基因表达数据的中间步骤,可以更轻松地揭示基因表达模式。SOM组件平面的“表达展示”以一种使临床医生能够评估分类选项而不是给出固定诊断的方式总结了复杂的数据。