Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken, Japan.
DNA Res. 2009 Oct;16(5):287-97. doi: 10.1093/dnares/dsp018. Epub 2009 Oct 3.
As a result of remarkable progresses of DNA sequencing technology, vast quantities of genomic sequences have been decoded. Homology search for amino acid sequences, such as BLAST, has become a basic tool for assigning functions of genes/proteins when genomic sequences are decoded. Although the homology search has clearly been a powerful and irreplaceable method, the functions of only 50% or fewer of genes can be predicted when a novel genome is decoded. A prediction method independent of the homology search is urgently needed. By analyzing oligonucleotide compositions in genomic sequences, we previously developed a modified Self-Organizing Map 'BLSOM' that clustered genomic fragments according to phylotype with no advance knowledge of phylotype. Using BLSOM for di-, tri- and tetrapeptide compositions, we developed a system to enable separation (self-organization) of proteins by function. Analyzing oligopeptide frequencies in proteins previously classified into COGs (clusters of orthologous groups of proteins), BLSOMs could faithfully reproduce the COG classifications. This indicated that proteins, whose functions are unknown because of lack of significant sequence similarity with function-known proteins, can be related to function-known proteins based on similarity in oligopeptide composition. BLSOM was applied to predict functions of vast quantities of proteins derived from mixed genomes in environmental samples.
由于 DNA 测序技术的显著进步,大量的基因组序列已经被解码。当基因组序列被解码时,对氨基酸序列(如 BLAST)进行同源性搜索已经成为赋予基因/蛋白质功能的基本工具。尽管同源性搜索显然是一种强大且不可替代的方法,但当解码新的基因组时,只有 50%或更少的基因的功能可以被预测。因此,迫切需要一种不依赖于同源性搜索的预测方法。通过分析基因组序列中的寡核苷酸组成,我们之前开发了一种改进的自组织映射“BLSOM”,它可以根据没有先验知识的系统发育型对基因组片段进行聚类。使用 BLSOM 分析二肽、三肽和四肽组成,我们开发了一种系统,可以根据功能对蛋白质进行分离(自组织)。分析先前根据 COG(蛋白质直系同源群簇)分类的寡肽频率,BLSOM 可以准确地再现 COG 分类。这表明,由于与功能已知的蛋白质缺乏显著的序列相似性,因此功能未知的蛋白质可以根据寡肽组成的相似性与功能已知的蛋白质相关联。BLSOM 被应用于预测从环境样本中混合基因组中大量蛋白质的功能。