Narechania Apurva, Bobo Dean, DeSalle Rob, Mathema Barun, Kreiswirth Barry, Planet Paul J
Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA.
Section for Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark.
Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf029.
Most microbes have the capacity to acquire genetic material from their environment. Recombination of foreign DNA yields genomes that are, at least in part, incongruent with the vertical history of their species. Dominant approaches for detecting these transfers are phylogenetic, requiring a painstaking series of analyses including alignment and tree reconstruction. But these methods do not scale. Here, we propose an unsupervised, alignment-free, and tree-free technique based on the sequential information bottleneck, an optimization procedure designed to extract some portion of relevant information from 1 random variable conditioned on another. In our case, this joint probability distribution tabulates occurrence counts of k-mers against their genomes of origin with the expectation that recombination will create a strong signal that unifies certain sets of co-occurring k-mers. We conceptualize the technique as a rate-distortion problem, measuring distortion in the relevance information as k-mers are compressed into clusters based on their co-occurrence in the source genomes. The result is fast, model-free, lossy compression of k-mers into learned groups of shared genome sequence, differentiating recombined elements from the vertically inherited core. We show that the technique yields a new recombination measure based purely on information, divorced from any biases and limitations inherent to alignment and phylogeny.
大多数微生物都有从其环境中获取遗传物质的能力。外源DNA的重组产生的基因组至少在部分程度上与其物种的垂直进化历史不一致。检测这些基因转移的主要方法是系统发育分析,这需要一系列细致的分析,包括序列比对和树状结构重建。但这些方法并不适用于大规模分析。在此,我们提出一种基于序列信息瓶颈的无监督、无需比对且无需构建树状结构的技术,这是一种优化程序,旨在从一个随机变量中提取基于另一个随机变量的相关信息的一部分。在我们的案例中,这种联合概率分布列出了k-mer在其原始基因组中的出现次数,预期重组会产生一个强烈信号,将某些共现的k-mer集合统一起来。我们将该技术概念化为一个率失真问题,可以在k-mer基于其在源基因组中的共现情况被压缩成簇时,衡量相关信息中的失真。结果是将k-mer快速、无模型地有损压缩为共享基因组序列的学习组,从而将重组元素与垂直遗传的核心区分开来。我们表明,该技术产生了一种全新的、完全基于信息的重组度量方法,摆脱了比对和系统发育分析固有的任何偏差和局限性。