Kutsenko Alexey S, Gizatullin Rinat Z, Al-Amin Ali N, Wang Fuli, Kvasha Sergei M, Podowski Raf M, Matushkin Yuri G, Gyanchandani Anita, Muravenko Olga V, Levitsky Viktor G, Kolchanov Nikolay A, Protopopov Alexei I, Kashuba Vladimir I, Kisselev Lev L, Wasserman Wyeth, Wahlestedt Claes, Zabarovsky Eugene R
Center for Genomics and Bioinformatics, Karolinska Institute, 171 77 Stockholm, Sweden.
Nucleic Acids Res. 2002 Jul 15;30(14):3163-70. doi: 10.1093/nar/gkf428.
A set of 22 551 unique human NotI flanking sequences (16.2 Mb) was generated. More than 40% of the set had regions with significant similarity to known proteins and expressed sequences. The data demonstrate that regions flanking NotI sites are less likely to form nucleosomes efficiently and resemble promoter regions. The draft human genome sequence contained 55.7% of the NotI flanking sequences, Celera's database contained matches to 57.2% of the clones and all public databases (including non-human and previously sequenced NotI flanks) matched 89.2% of the NotI flanking sequences (identity > or =90% over at least 50 bp, data from December 2001). The data suggest that the shotgun sequencing approach used to generate the draft human genome sequence resulted in a bias against cloning and sequencing of NotI flanks. A rough estimation (based primarily on chromosomes 21 and 22) is that the human genome contains 15 000-20 000 NotI sites, of which 6000-9000 are unmethylated in any particular cell. The results of the study suggest that the existing tools for computational determination of CpG islands fail to identify a significant fraction of functional CpG islands, and unmethylated DNA stretches with a high frequency of CpG dinucleotides can be found even in regions with low CG content.
生成了一组22551个独特的人类NotI侧翼序列(16.2兆碱基)。该序列集中超过40%的区域与已知蛋白质和表达序列具有显著相似性。数据表明,NotI位点侧翼区域不太可能有效地形成核小体,且类似于启动子区域。人类基因组序列草图包含了55.7%的NotI侧翼序列,Celera数据库与57.2%的克隆相匹配,所有公共数据库(包括非人类和先前测序的NotI侧翼)与89.2%的NotI侧翼序列相匹配(在至少50个碱基上的同一性≥90%,数据来自2001年12月)。数据表明,用于生成人类基因组序列草图的鸟枪法测序方法导致了对NotI侧翼克隆和测序的偏差。初步估计(主要基于21号和22号染色体),人类基因组包含15000 - 20000个NotI位点,其中6000 - 9000个在任何特定细胞中都是未甲基化的。研究结果表明,现有的用于计算确定CpG岛的工具未能识别出相当一部分功能性CpG岛,而且即使在CG含量较低的区域也能发现具有高频率CpG二核苷酸的未甲基化DNA片段。