Abe Takashi, Sugawara Hideaki, Kinouchi Makoto, Kanaya Shigehiko, Ikemura Toshimichi
Center for Information Biology, National Institute of Genetics, The Graduate University for Advanced Studies (Sokendai) Mishima, Shizuoka, Japan.
DNA Res. 2005;12(5):281-90. doi: 10.1093/dnares/dsi015. Epub 2006 Jan 10.
A self-organizing map (SOM) was developed as a novel bioinformatics strategy for phylogenetic classification of sequence fragments obtained from pooled genome samples of uncultured microbes in environmental and clinical samples. This phylogenetic classification was possible without either orthologous sequence sets or sequence alignments. We first constructed SOMs for tetranucleotide frequencies in 210,000 5 kb sequence fragments obtained from 1502 prokaryotes for which at least 10 kb of genomic sequence has been deposited in public DNA databases. The sequences could be classified primarily according to phylogenetic groups without information regarding the species. We used the SOM method to classify sequence fragments derived from environmental samples of the Sargasso Sea and of an acidophilic biofilm growing in acid mine drainage. Phylogenetic diversity of the environmental sequences was effectively visualized on a single map. Sequences that were derived from a single genome but cloned independently could be reassociated in silico. G + C% has been used for a long period as a fundamental parameter for phylogenetic classification of microbes, but the G + C% is apparently too simple a parameter to differentiate a wide variety of known species. Oligonucleotide frequency can be used to distinguish the species because oligonucleotide frequencies vary significantly among their genomes.
自组织映射(SOM)作为一种新型生物信息学策略被开发出来,用于对从环境和临床样本中未培养微生物的混合基因组样本获得的序列片段进行系统发育分类。这种系统发育分类无需直系同源序列集或序列比对即可实现。我们首先针对从1502种原核生物获得的210,000个5 kb序列片段中的四核苷酸频率构建了SOM,这些原核生物至少有10 kb的基因组序列已存于公共DNA数据库中。这些序列可以主要根据系统发育组进行分类,而无需物种信息。我们使用SOM方法对源自马尾藻海环境样本和酸性矿山排水中生长的嗜酸生物膜的序列片段进行分类。环境序列的系统发育多样性在单张图谱上得到了有效呈现。源自单个基因组但独立克隆的序列可以在计算机上重新关联。长期以来,G + C%一直被用作微生物系统发育分类的基本参数,但G + C%显然是一个过于简单的参数,无法区分各种各样的已知物种。寡核苷酸频率可用于区分物种,因为寡核苷酸频率在其基因组之间存在显著差异。