Abe Takashi, Kanaya Shigehiko, Kinouchi Makoto, Ichiba Yuta, Kozuki Tokio, Ikemura Toshimichi
Department of Population Genetics, National Institute of Genetics, Mishima, Shizuoka-ken 411-8540, Japan.
Genome Inform. 2002;13:12-20.
With the increasing amount of available genome sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. We used an unsupervised neural network algorithm, Kohonen's self-organizing map (SOM), to analyze di- and trinucleotide frequencies in 9 eukaryotic genomes of known sequences (a total of 1.2 Gb); S. cerevisiae, S. pombe, C. elegans, A. thaliana, D. melanogaster, Fugu, and rice, as well as P. falciparum chromosomes 2 and 3, and human chromosomes 14, 20, 21, and 22, that have been almost completely sequenced. Each genomic sequence with different window sizes was encoded as a 16- and 64-dimensional vector giving relative frequencies of di- and trinucleotides, respectively. From analysis of a total of 120,000 nonoverlapping 10-kb sequences and overlapping 100-kb sequences with a moving step size of 10 kb, derived from a total of the 1.2 Gb genomic sequences, clear species-specific separations of most sequences were obtained with the SOMs. The unsupervised algorithm could recognize, in most of the 120,000 10-kb sequences, the species-specific characteristics (key combinations of oligonucleotide frequencies) that are signature representations of each genome. Because the classification power is very high, the SOMs can provide fundamental bioinformatic strategies for extracting a wide range of genomic information that could not otherwise be obtained.
随着可用基因组序列数量的不断增加,需要新的工具来全面分析各种基因组的物种特异性序列特征。我们使用了一种无监督神经网络算法——Kohonen自组织映射(SOM),来分析9个已知序列的真核生物基因组(总计1.2Gb)中的二核苷酸和三核苷酸频率;这些基因组包括酿酒酵母、粟酒裂殖酵母、秀丽隐杆线虫、拟南芥、黑腹果蝇、河豚、水稻,以及恶性疟原虫的2号和3号染色体,还有人类的14号、20号、21号和22号染色体,这些染色体已几乎完全测序。每个具有不同窗口大小的基因组序列分别被编码为一个16维和64维向量,给出二核苷酸和三核苷酸的相对频率。通过对总共1.2Gb基因组序列中总共120,000个不重叠的10kb序列和步长为10kb的重叠100kb序列进行分析,利用SOM获得了大多数序列清晰的物种特异性分离。这种无监督算法能够在120,000个10kb序列中的大多数序列中识别出作为每个基因组特征表示的物种特异性特征(寡核苷酸频率的关键组合)。由于分类能力非常高,SOM可以提供基本的生物信息学策略,用于提取否则无法获得的广泛基因组信息。