使用SSB树对全基因组序列的常见k-mer进行分析。

Choi Jeong-Hyeon, Cho Hwan-Gue

ALGORIGENE Bioinformatics Lab., Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735, Korea.

Genome Inform. 2002;13:30-41.

As sequenced genomes become larger and sequencing process becomes faster, there is a need to develop a tool to analyze sequences in the whole genomic scale. However, on-memory algorithms such as suffix tree and suffix array are not applicable to the analysis of whole genome sequence set, since the size of individual whole genome ranges from several million base pairs to hundreds billion base pairs. In order to effectively manipulate the huge sequence data, it is necessary to use the indexed data structure for external memory. In this paper, we introduce a workbench called SequeX for the analysis and visualization of whole genome sequences using SSB-tree (Static SB-tree). It consists of two parts: the analysis query subsystem and the visualization subsystem. The query subsystem supports various transactions such as pattern matching, k-occurrence, and k-mer analysis. The visualization subsystem helps biologists to easily understand whole genome structure and feature by sequence viewer, annotation viewer, CGR (Chaos Game Representation) viewer, and k-mer viewer. The system also supports a user-friendly programming interface based on Java script for batch processing and the extension for a specific purpose of a user. SequeX can be used to identify conserved genes or sequences by the analysis of the common k-mers and annotation. We analyze the common k-mer for 72 microbial genomes announced by Entrez, and find an interesting biological fact that the longest common k-mer for 72 sequences is 11-mer, and only 11 such sequences exist. Finally we note that many common k-mers occur in conserved region such as CDS, rRNA, and tRNA.

随着测序基因组规模不断增大且测序过程加快，有必要开发一种工具来在全基因组范围内分析序列。然而，诸如后缀树和后缀数组等内存算法并不适用于全基因组序列集的分析，因为单个全基因组的大小从数百万碱基对到数千亿碱基对不等。为了有效处理庞大的序列数据，有必要使用适用于外部存储器的索引数据结构。在本文中，我们介绍了一个名为SequeX的工作台，用于使用SSB树（静态SB树）对全基因组序列进行分析和可视化。它由两部分组成：分析查询子系统和可视化子系统。查询子系统支持各种事务，如模式匹配、k次出现和k-mer分析。可视化子系统通过序列查看器、注释查看器、CGR（混沌游戏表示）查看器和k-mer查看器，帮助生物学家轻松理解全基因组的结构和特征。该系统还支持基于JavaScript的用户友好编程接口，用于批处理和用户特定目的的扩展。SequeX可用于通过分析常见的k-mer和注释来识别保守基因或序列。我们分析了Entrez公布的72个微生物基因组的常见k-mer，发现了一个有趣的生物学事实，即72个序列中最长的常见k-mer是11-mer，且仅存在11个这样的序列。最后我们注意到许多常见的k-mer出现在保守区域，如CDS、rRNA和tRNA中。