Bai Yu, Iwasaki Yuki, Kanaya Shigehiko, Zhao Yue, Ikemura Toshimichi
Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara 630-0192, Japan.
Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
Biomed Res Int. 2014;2014:765648. doi: 10.1155/2014/765648. Epub 2014 Apr 3.
With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a "genome signature," and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).
随着各种物种基因组序列数据的显著增加,需要新的工具来对大量序列数据进行全面分析。自组织映射(SOM)是一种有效的工具,可用于在一张图上对高维数据(如寡核苷酸组成)进行聚类和可视化。通过对传统SOM进行改进,我们之前开发了批学习SOM(BLSOM),它仅根据寡核苷酸组成就能根据物种对序列片段进行分类。在本研究中,我们介绍了用于表征脊椎动物基因组序列的寡核苷酸BLSOM。我们首先分析了来自各种脊椎动物基因组的100 kb序列中的五核苷酸组成,然后分析了人类和小鼠基因组中的组成,以研究检测密切相关基因组之间差异的有效方法。BLSOM可以识别每个基因组中寡核苷酸频率的物种特异性关键组合,即“基因组特征”,以及转录因子结合序列特异性富集的特定区域。由于分类和可视化能力非常高,BLSOM是从大量基因组序列(即大量序列数据)中提取广泛信息的高效强大工具。