Pietrokovski S, Hirshon J, Trifonov E N
Department of Polymer Research, Weizmann Institute of Science, Rehovot, Israel.
J Biomol Struct Dyn. 1990 Jun;7(6):1251-68. doi: 10.1080/07391102.1990.10508563.
The frequencies of "words", oligonucleotides within nucleotide sequences, reflect the genetic information contained in the sequence "texts". Nucleotide sequences are characteristically represented by their contrast word vocabularies. Comparison of the sequences by correlating their contrast vocabularies is shown to reflect well the relatedness (unrelatedness) between the sequences. A single value, the linguistic similarity between the sequences, is suggested as a measure of sequence relatedness. Sequences as short as 1000 bases can be characterized and quantitatively related to other sequences by this technique. The linguistic sequence similarity value is used for analysis of taxonomically and functionally diverse nucleotide sequences. The similarity value is shown to be very sensitive to the relatedness of the source species, thus providing a convenient tool for taxonomic classification of species by their sequence vocabularies. Functionally diverse sequences appear distinct by their linguistic similarity values. This can be a basis for a quick screening technique for functional characterization of the sequences and for mapping functionally distinct regions in long sequences.
“单词”(核苷酸序列中的寡核苷酸)的频率反映了序列“文本”中包含的遗传信息。核苷酸序列的特征由其对比词词汇表表示。通过关联它们的对比词汇表来比较序列,结果显示能够很好地反映序列之间的相关性(不相关性)。建议用一个单一值,即序列之间的语言相似性,作为序列相关性的度量。通过这种技术,短至1000个碱基的序列也能够被表征,并与其他序列进行定量关联。语言序列相似性值用于分析分类学上和功能上多样的核苷酸序列。结果表明,相似性值对源物种的相关性非常敏感,从而为根据物种的序列词汇表进行分类学分类提供了一种便捷工具。功能多样的序列通过其语言相似性值显得各不相同。这可以作为一种快速筛选技术的基础,用于序列的功能表征以及在长序列中定位功能不同的区域。