Abe Takashi, Sugawara Hideaki, Kanaya Shigehiko, Kinouchi Makoto, Ikemura Toshimichi
Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, and The Graduate University for Advanced Studies (Sokendai), Mishima, Shizuoka 411-8540, Japan.
Gene. 2006 Jan 3;365:27-34. doi: 10.1016/j.gene.2005.09.040. Epub 2005 Dec 20.
Novel tools are needed for comprehensive comparisons of interspecies characteristics of massive amounts of genomic sequences currently available. An unsupervised neural network algorithm, Self-Organizing Map (SOM), is an effective tool for clustering and visualizing high-dimensional complex data on a single map. We modified the conventional SOM, on the basis of batch-learning SOM, for genome informatics making the learning process and resulting map independent of the order of data input. We generated the SOMs for tri- and tetranucleotide frequencies in 10- and 100-kb sequence fragments from 38 eukaryotes for which almost complete genome sequences are available. SOM recognized species-specific characteristics (key combinations of oligonucleotide frequencies) in the genomic sequences, permitting species-specific classification of the sequences without any information regarding the species. We also generated the SOM for tetranucleotide frequencies in 1-kb sequence fragments from the human genome and found sequences for four functional categories (5' and 3' UTRs, CDSs and introns) were classified primarily according to the categories. Because the classification and visualization power is very high, SOM is an efficient and powerful tool for extracting a wide range of genome information.
目前需要新的工具来全面比较现有的大量基因组序列的种间特征。一种无监督神经网络算法——自组织映射(SOM),是在单个地图上对高维复杂数据进行聚类和可视化的有效工具。我们在批处理学习SOM的基础上对传统SOM进行了修改,用于基因组信息学,使学习过程和生成的地图独立于数据输入顺序。我们针对38种真核生物10 kb和100 kb序列片段中的三核苷酸和四核苷酸频率生成了SOM,这些真核生物几乎拥有完整的基因组序列。SOM识别基因组序列中的物种特异性特征(寡核苷酸频率的关键组合),无需任何关于物种的信息即可对序列进行物种特异性分类。我们还针对人类基因组1 kb序列片段中的四核苷酸频率生成了SOM,发现四个功能类别(5'和3'非翻译区、编码区和内含子)的序列主要根据类别进行分类。由于分类和可视化能力非常高,SOM是提取广泛基因组信息的高效且强大的工具。