Vernikos Georgios S, Parkhill Julian
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
Genome Res. 2008 Feb;18(2):331-42. doi: 10.1101/gr.7004508. Epub 2007 Dec 10.
Large inserts of horizontally acquired DNA that contain functionally related genes with limited phylogenetic distribution are often referred to as genomic islands (GIs), and structural definitions of these islands, based on common features, have been proposed. Although a large number of mobile elements fall well within the GI definition, there are several concerns about the structural consensus for GIs: The current GI definition was put forward 10 yr ago when only 12 complete bacterial genomes were available, a large number of GIs deviate from that definition, and in silico predictions assuming a full/partial GI structural model bias the sampling of the GI structural space toward "well-structured" GIs. In this study, the structural features of genomic regions are sampled by a hypothesis-free, bottom-up search, and these are exploited in a machine learning approach with the aim of explicitly quantifying and modeling the contribution of each feature to the GI structure. Performing a whole-genome-based comparative analysis between 37 strains of three different genera and 12 outgroup genomes, 668 genomic regions were sampled and used to train structural GI models. The data show that, overall, GIs from the three different genera fall into distinct, genus-specific structural families. However, decreasing the taxa resolution, by studying GI structures across different genus boundaries, provides models that converge on a fairly similar GI structure, further suggesting that GIs can be seen as a superfamily of mobile elements, with core and variable structural features, rather than a well-defined family.
水平获得的大片段DNA插入序列,包含系统发育分布有限但功能相关的基因,通常被称为基因组岛(GIs),基于共同特征对这些岛进行了结构定义。尽管大量的移动元件完全符合基因组岛的定义,但对于基因组岛的结构共识仍存在一些问题:当前的基因组岛定义是在10年前提出的,当时仅有12个完整的细菌基因组可供使用,大量的基因组岛偏离了该定义,并且基于全基因组/部分基因组岛结构模型的计算机预测使基因组岛结构空间的采样偏向于“结构良好”的基因组岛。在本研究中,通过无假设的自下而上搜索对基因组区域的结构特征进行采样,并将其应用于机器学习方法,旨在明确量化和建模每个特征对基因组岛结构的贡献。对三个不同属的37个菌株和12个外群基因组进行基于全基因组的比较分析,采样了668个基因组区域并用于训练基因组岛结构模型。数据表明,总体而言,来自三个不同属的基因组岛属于不同的、属特异性结构家族。然而,通过跨不同属边界研究基因组岛结构来降低分类分辨率,得到的模型趋于相当相似的基因组岛结构,这进一步表明基因组岛可被视为具有核心和可变结构特征的移动元件超家族,而非定义明确的家族。