Herrero J, Valencia A, Dopazo J
Bioinformatics, CNIO, Ctra. Majadahonda-Pozuelo, Km 2, Majadahonda, 28220 Madrid Protein Design Group CNB-CSIC, 28049 Madrid, Spain.
Bioinformatics. 2001 Feb;17(2):126-36. doi: 10.1093/bioinformatics/17.2.126.
We describe a new approach to the analysis of gene expression data coming from DNA array experiments, using an unsupervised neural network. DNA array technologies allow monitoring thousands of genes rapidly and efficiently. One of the interests of these studies is the search for correlated gene expression patterns, and this is usually achieved by clustering them. The Self-Organising Tree Algorithm, (SOTA) (Dopazo,J. and Carazo,J.M. (1997) J. Mol. Evol., 44, 226-233), is a neural network that grows adopting the topology of a binary tree. The result of the algorithm is a hierarchical cluster obtained with the accuracy and robustness of a neural network.
SOTA clustering confers several advantages over classical hierarchical clustering methods. SOTA is a divisive method: the clustering process is performed from top to bottom, i.e. the highest hierarchical levels are resolved before going to the details of the lowest levels. The growing can be stopped at the desired hierarchical level. Moreover, a criterion to stop the growing of the tree, based on the approximate distribution of probability obtained by randomisation of the original data set, is provided. By means of this criterion, a statistical support for the definition of clusters is proposed. In addition, obtaining average gene expression patterns is a built-in feature of the algorithm. Different neurons defining the different hierarchical levels represent the averages of the gene expression patterns contained in the clusters. Since SOTA runtimes are approximately linear with the number of items to be classified, it is especially suitable for dealing with huge amounts of data. The method proposed is very general and applies to any data providing that they can be coded as a series of numbers and that a computable measure of similarity between data items can be used.
A server running the program can be found at: http://bioinfo.cnio.es/sotarray.
我们描述了一种使用无监督神经网络分析来自DNA阵列实验的基因表达数据的新方法。DNA阵列技术能够快速且高效地监测数千个基因。这些研究的一个关注点是寻找相关的基因表达模式,这通常通过对它们进行聚类来实现。自组织树算法(SOTA)(多帕佐,J.和卡拉佐,J.M.(1997年)《分子进化杂志》,44卷,226 - 233页)是一种采用二叉树拓扑结构生长的神经网络。该算法的结果是一个具有神经网络准确性和鲁棒性的层次聚类。
与经典的层次聚类方法相比,SOTA聚类具有多个优势。SOTA是一种分裂方法:聚类过程从顶部到底部进行,即最高层次在处理最低层次的细节之前就已确定。生长过程可以在期望的层次级别停止。此外,还提供了一个基于原始数据集随机化得到的概率近似分布来停止树生长的标准。借助这个标准,为聚类的定义提供了统计支持。另外,获取平均基因表达模式是该算法的一个内置功能。定义不同层次级别的不同神经元代表聚类中包含的基因表达模式的平均值。由于SOTA运行时间与待分类项目数量大致呈线性关系,它特别适合处理大量数据。所提出的方法非常通用,适用于任何数据,前提是它们可以编码为一系列数字,并且可以使用数据项之间可计算的相似性度量。
可以在以下网址找到运行该程序的服务器:http://bioinfo.cnio.es/sotarray 。