Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, Kimura T, Hosouchi T, Matsuno A, Muraki A, Nakazaki N, Naruo K, Okumura S, Shimpo S, Takeuchi C, Wada T, Watanabe A, Yamada M, Yasuda M, Tabata S
Kazusa DNA Research Institute, Chiba, Japan.
DNA Res. 1996 Jun 30;3(3):109-36. doi: 10.1093/dnares/3.3.109.
The sequence determination of the entire genome of the Synechocystis sp. strain PCC6803 was completed. The total length of the genome finally confirmed was 3,573,470 bp, including the previously reported sequence of 1,003,450 bp from map position 64% to 92% of the genome. The entire sequence was assembled from the sequences of the physical map-based contigs of cosmid clones and of lambda clones and long PCR products which were used for gap-filling. The accuracy of the sequence was guaranteed by analysis of both strands of DNA through the entire genome. The authenticity of the assembled sequence was supported by restriction analysis of long PCR products, which were directly amplified from the genomic DNA using the assembled sequence data. To predict the potential protein-coding regions, analysis of open reading frames (ORFs), analysis by the GeneMark program and similarity search to databases were performed. As a result, a total of 3,168 potential protein genes were assigned on the genome, in which 145 (4.6%) were identical to reported genes and 1,257 (39.6%) and 340 (10.8%) showed similarity to reported and hypothetical genes, respectively. The remaining 1,426 (45.0%) had no apparent similarity to any genes in databases. Among the potential protein genes assigned, 128 were related to the genes participating in photosynthetic reactions. The sum of the sequences coding for potential protein genes occupies 87% of the genome length. By adding rRNA and tRNA genes, therefore, the genome has a very compact arrangement of protein- and RNA-coding regions. A notable feature on the gene organization of the genome was that 99 ORFs, which showed similarity to transposase genes and could be classified into 6 groups, were found spread all over the genome, and at least 26 of them appeared to remain intact. The result implies that rearrangement of the genome occurred frequently during and after establishment of this species.
集胞藻6803株全基因组序列测定工作已经完成。最终确认的基因组全长为3,573,470碱基对,其中包括先前报道的位于基因组64%至92%图谱位置处的1,003,450碱基对序列。整个序列是由粘粒克隆、λ克隆以及用于填补缺口的长PCR产物的基于物理图谱的重叠群序列组装而成。通过对全基因组DNA两条链的分析保证了序列的准确性。从基因组DNA直接扩增得到的长PCR产物的限制性分析支持了组装序列的真实性,扩增过程使用了组装后的序列数据。为了预测潜在的蛋白质编码区域,进行了开放阅读框(ORF)分析、使用GeneMark程序分析以及与数据库的相似性搜索。结果显示,基因组上共确定了3168个潜在的蛋白质基因,其中145个(4.6%)与已报道基因相同,1257个(39.6%)和340个(10.8%)分别与已报道基因和假设基因具有相似性。其余1426个(45.0%)与数据库中的任何基因均无明显相似性。在已确定的潜在蛋白质基因中,有128个与参与光合作用反应的基因相关。编码潜在蛋白质基因的序列总和占基因组长度的87%。因此,加上rRNA和tRNA基因后,基因组的蛋白质编码区和RNA编码区排列非常紧凑。该基因组基因组织的一个显著特点是,发现99个与转座酶基因相似且可分为6组的ORF分布在整个基因组中,其中至少26个似乎仍然完整。这一结果表明,在该物种形成期间及之后,基因组重排频繁发生。