Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, 72205, USA.
The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee, Knoxville, Tennessee, 37996, USA.
Commun Biol. 2021 Jan 26;4(1):117. doi: 10.1038/s42003-020-01626-5.
In this study, more than one hundred thousand Escherichia coli and Shigella genomes were examined and classified. This is, to our knowledge, the largest E. coli genome dataset analyzed to date. A Mash-based analysis of a cleaned set of 10,667 E. coli genomes from GenBank revealed 14 distinct phylogroups. A representative genome or medoid identified for each phylogroup was used as a proxy to classify 95,525 unassembled genomes from the Sequence Read Archive (SRA). We find that most of the sequenced E. coli genomes belong to four phylogroups (A, C, B1 and E2(O157)). Authenticity of the 14 phylogroups is supported by several different lines of evidence: phylogroup-specific core genes, a phylogenetic tree constructed with 2613 single copy core genes, and differences in the rates of gene gain/loss/duplication. The methodology used in this work is able to reproduce known phylogroups, as well as to identify previously uncharacterized phylogroups in E. coli species.
在这项研究中,我们检查和分类了超过 10 万个大肠杆菌和志贺氏菌基因组。据我们所知,这是迄今为止分析的最大的大肠杆菌基因组数据集。对来自 GenBank 的 10667 个大肠杆菌基因组的清洁数据集进行基于 Mash 的分析显示了 14 个不同的系统发育群。为每个系统发育群选择一个代表基因组或中位数,用作代理来对来自序列读取档案(SRA)的 95525 个未组装基因组进行分类。我们发现,大多数测序的大肠杆菌基因组属于四个系统发育群(A、C、B1 和 E2(O157))。14 个系统发育群的真实性得到了几个不同证据的支持:系统发育群特异性核心基因、使用 2613 个单拷贝核心基因构建的系统发育树,以及基因获得/损失/复制率的差异。本工作中使用的方法能够重现已知的系统发育群,以及鉴定大肠杆菌物种中以前未表征的系统发育群。