Karlin S, Mrázek J, Campbell A M
Department of Mathematics, Stanford University, CA 94305-2125, USA.
Nucleic Acids Res. 1996 Nov 1;24(21):4263-72. doi: 10.1093/nar/24.21.4263.
The complete Haemophilus influenzae genome (1.83 Mb, Rd strain) provides opportunities for characterizing global genomic inhomogeneities and for detecting important sequence signals. Along these lines, new methods for identifying frequent words (oligonucleotides and/or peptides) and their distributions are applied to the H.influenzae genome with some comparisons and contrasts made with frequent words of other bacterial genomes. Three major classes of frequent oligonucleotides stand out: (i) oligos related to the familiar uptake signal sequences (USSs), AAGTGCGGT (USS+) and its inverted complement (USS-), (ii) multiple tetranucleotide iterations and (iii) intergenic dyad sequences (ISDs) found as AAGCCCACCCTAC and its dyad form. The USS+ and USS- occur in almost equal counts, are remarkably evenly spaced around the genome, and appear predominantly in the same reading frame of protein coding domains (USS+ translated to Ser-Ala-Val, USS- translated to Thr-Ala-Leu). These observations suggest that USSs contribute to global genomic functions, for example, in replication and/or repair processes, or as membrane attachment sites, or as sequences helping to pack DNA. The long tetranucleotide iterations, virtually unique to H.influenzae (i.e., unknown in other prokaryotes), through polymerase slippage during replication and/or homologous recombination may produce subpopulations expressing alternative proteins. The 13 bp frequent IDS words, invariably intergenic, occur mostly in clusters and provide potential for complex secondary structures suggesting that these sequences may be important signals for regulating the activity of their flanking genes. The frequent oligopeptides of H.influenzae are principally of two kinds--those induced by oligonucleotide frequent words (USSs, tetranucleotide iterations), and those associated with ATP or GTP binding sites that are generally composed of three motifs: the A-box which contributes to delineating the binding pocket; the B-box which functions in hydrolysis; and the C-box whose function is unknown. The A-box occurs fairly universally in prokaryotes and eukaryotes. The B- and C-motifs appear to be specialized to various functional groups (e.g., transport, recombination, chaperone activity). Other putative motifs correspond to homologs of Escherichia coli motifs, for example, are associated with proteins of transcriptional processing, aminoacyl-tRNA synthetases and proteins functioning in electron transfer.
完整的流感嗜血杆菌基因组(1.83 Mb,Rd菌株)为表征全球基因组不均匀性和检测重要序列信号提供了机会。沿着这些思路,用于识别频繁出现的单词(寡核苷酸和/或肽)及其分布的新方法被应用于流感嗜血杆菌基因组,并与其他细菌基因组的频繁出现的单词进行了一些比较和对比。三类主要的频繁出现的寡核苷酸脱颖而出:(i)与熟悉的摄取信号序列(USSs)相关的寡核苷酸,AAGTGCGGT(USS+)及其反向互补序列(USS-),(ii)多个四核苷酸重复序列,以及(iii)作为AAGCCCACCCTAC及其二元形式发现的基因间二元序列(ISDs)。USS+和USS-出现的次数几乎相等,在基因组周围分布非常均匀,并且主要出现在蛋白质编码域的相同阅读框中(USS+翻译为Ser-Ala-Val,USS-翻译为Thr-Ala-Leu)。这些观察结果表明,USSs有助于全球基因组功能,例如在复制和/或修复过程中,或作为膜附着位点,或作为有助于包装DNA的序列。长的四核苷酸重复序列几乎是流感嗜血杆菌所特有的(即在其他原核生物中未知),通过复制过程中的聚合酶滑动和/或同源重组可能产生表达替代蛋白质的亚群。13 bp的频繁出现的IDS单词总是基因间的,大多成簇出现,并提供了形成复杂二级结构的潜力,这表明这些序列可能是调节其侧翼基因活性的重要信号。流感嗜血杆菌频繁出现的寡肽主要有两种——由寡核苷酸频繁出现的单词(USSs、四核苷酸重复序列)诱导的那些,以及与ATP或GTP结合位点相关的那些,这些结合位点通常由三个基序组成:有助于界定结合口袋的A框;在水解中起作用的B框;以及功能未知的C框。A框在原核生物和真核生物中相当普遍地出现。B基序和C基序似乎专门针对各种功能组(例如运输、重组、伴侣活性)。其他推定的基序对应于大肠杆菌基序的同源物,例如,与转录加工蛋白、氨酰-tRNA合成酶和在电子传递中起作用的蛋白相关。