Department of Computer Science, Rice University, Houston, Texas, United States of America.
Department of BioSciences, Rice University, Houston, Texas, United States of America.
PLoS Comput Biol. 2022 Jun 8;18(6):e1010216. doi: 10.1371/journal.pcbi.1010216. eCollection 2022 Jun.
Phylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences (MHGs) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa.
对原核生物类群的系统基因组学研究通常假设保守的标记基因在其全长范围内是同源的。然而,水平基因转移或基因复制和丢失等过程可能通过仅重组基因的部分来破坏这种同源性,从而导致基因分裂或融合。我们通过模拟表明,有必要在不依赖基因注释来定义同源区域边界的情况下,在一组细菌基因组中划定同源性组。为了解决这个问题,我们开发了一种基于图的算法,将一组细菌基因组划分为最大同源序列组(Maximal Homologous Groups of sequences,MHGs),其中每个 MHG 是一组最大长度序列的最大集合,这些序列在整个序列比对中是同源的。我们将我们的算法应用于 19 种肠杆菌科物种的数据集,发现 MHGs 覆盖了基因组的更大比例,与标记物相比,它们的功能覆盖范围也不那么偏向。我们放大了每个单独标记与其重叠 MHG 之间的相关性,并表明,标记物支持的少数系统发育分裂得到了 MHG 的支持,而许多标记物支持的分裂则与 MHG 相矛盾。与从 MHGs 推断的物种树相比,从标记基因推断的物种树表明,标记物的增加偏差和缺乏基因组覆盖导致了对细菌类群之间整体关系的不正确推断。