Pavlović-Lazetić Gordana M, Mitić Nenad S, Beljanski Milos V
University of Belgrade, Studentski trg 16, Belgrade, Serbia.
Comput Methods Programs Biomed. 2009 Mar;93(3):241-56. doi: 10.1016/j.cmpb.2008.10.014. Epub 2008 Dec 19.
The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.
本文提出了一种新颖的、基于n元语法的方法,用于分析被称为基因组岛(GIs)的细菌基因组片段。在细菌基因组中识别基因组岛是一项重要任务,因为其中许多代表的插入片段可能有助于细菌的进化和致病机制。为了表征基因组岛并将其与基因组的其余部分区分开来,基于n元语法频率分布对基因组岛进行了二元分类。这包括测试基因组岛n元语法频率分布与完整基因组和主干序列的一致性。此外,基于最大阶马尔可夫模型的统计量用于识别基因组岛中显著过度代表和代表性不足的n元语法。结果可作为类似齐普夫分析的基础,表明某些n元语法在一部分基因组岛中过度代表而在主干中代表性不足,反之亦然,从而补充了二元分类。该方法应用于大肠杆菌O157:H7 EDL933基因组中的菌株特异性区域(O岛),产生了两组具有不同n元语法特征的O岛。它完善了基于其他组成特征(如G+C含量和密码子使用)的表征,可能有助于识别基因组岛,也有助于针对其中毒力基因研发合适的药物。