Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland, St Lucia, Brisbane, QLD 4072, Australia.
School of Mathematics and Statistics, The University of Melbourne, Parkville, Melbourne, VIC 3010, Australia.
Sci Rep. 2016 Jul 25;6:29319. doi: 10.1038/srep29319.
Many microbes can acquire genetic material from their environment and incorporate it into their genome, a process known as lateral genetic transfer (LGT). Computational approaches have been developed to detect genomic regions of lateral origin, but typically lack sensitivity, ability to distinguish donor from recipient, and scalability to very large datasets. To address these issues we have introduced an alignment-free method based on ideas from document analysis, term frequency-inverse document frequency (TF-IDF). Here we examine the performance of TF-IDF on three empirical datasets: 27 genomes of Escherichia coli and Shigella, 110 genomes of enteric bacteria, and 143 genomes across 12 bacterial and three archaeal phyla. We investigate the effect of k-mer size, gap size and delineation of groups on the inference of genomic regions of lateral origin, finding an interplay among these parameters and sequence divergence. Because TF-IDF identifies donor groups and delineates regions of lateral origin within recipient genomes, aggregating these regions by gene enables us to explore, for the first time, the mosaic nature of lateral genes including the multiplicity of biological sources, ancestry of transfer and over-writing by subsequent transfers. We carry out Gene Ontology enrichment tests to investigate which biological processes are potentially affected by LGT.
许多微生物可以从环境中获取遗传物质并将其整合到基因组中,这个过程被称为横向基因转移(Lateral Genetic Transfer,LGT)。已经开发了一些计算方法来检测横向起源的基因组区域,但这些方法通常缺乏敏感性、无法区分供体和受体,并且难以扩展到非常大的数据集。为了解决这些问题,我们引入了一种基于文档分析思想的无比对方法,即词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)。在这里,我们在三个经验数据集上检查了 TF-IDF 的性能:27 个大肠杆菌和志贺氏菌基因组、110 个肠细菌基因组以及 12 个细菌和 3 个古菌门的 143 个基因组。我们研究了 k-mer 大小、间隙大小和组的划分对横向起源基因组区域推断的影响,发现这些参数之间存在相互作用以及序列分歧。由于 TF-IDF 可以识别供体群体并在受体基因组内划定横向起源区域,因此通过基因聚合这些区域,我们可以首次探索横向基因的镶嵌性质,包括多种生物来源、转移的祖先以及随后转移的覆盖。我们进行了基因本体论(Gene Ontology)富集测试,以研究哪些生物过程可能受到 LGT 的影响。