Department of Systems Biology, Center for Biological Sequence Analysis, The Technical University of Denmark, 2800 Kgs, Lyngby, Denmark.
Environ Microbiol. 2013 Dec;15(12):3121-9. doi: 10.1111/1462-2920.12236. Epub 2013 Aug 29.
Everyone working with bacterial genomics is familiar with the phrase 'too much data'. In this Genome Update, we discuss two methods for helping to deal with this explosion of genomic information. First, we introduce the concept of calculating a quality score for each sequenced genome, and second, we describe a method to quickly sort through genomes for a particular set of protein families. We apply these two methods to all of the current Escherichia coli genomes available in the The National Center for Biotechnology Information database. Out of the 2074 E. coli/Shigella genomes listed (June, 2013), only less than half (983) are of sufficient quality to use in comparative genomic work. Unfortunately, even some of the 'complete' E. coli genomes are in pieces, and a few 'draft' genomes are good quality. Six of the seven known sigma factors in E. coli strain K-12 are extremely well conserved; the iron-regulating sigma factor FecI (σ(19) ) is missing in most genomes. Surprisingly, the E. coli strain CFT073 genome does not encode a functional RpoD (σ(70) ), which is obviously essential, and this is likely due to poor genome assembly/annotation. We find a possible novel sigma factor present in more than a hundred E. coli genomes.
每个从事细菌基因组学研究的人都对“数据过多”这个词非常熟悉。在本期基因组更新中,我们将讨论两种方法来帮助处理这种基因组信息的爆炸式增长。首先,我们引入了为每个测序基因组计算质量得分的概念;其次,我们描述了一种快速筛选特定蛋白质家族基因组的方法。我们将这两种方法应用于美国国立生物技术信息中心数据库中所有现有的大肠杆菌基因组。在列出的 2074 个大肠杆菌/志贺氏菌基因组中(2013 年 6 月),只有不到一半(983 个)具有足够的质量可用于比较基因组学研究。不幸的是,即使是一些“完整”的大肠杆菌基因组也是片段化的,而一些“草图”基因组的质量却很好。大肠杆菌 K-12 株的七个已知 sigma 因子中的六个高度保守;大多数基因组中都缺少铁调节 sigma 因子 FecI(σ(19))。令人惊讶的是,大肠杆菌 CFT073 基因组不编码功能齐全的 RpoD(σ(70)),这显然是必需的,这可能是由于基因组组装/注释不佳所致。我们在一百多个大肠杆菌基因组中发现了一个可能的新 sigma 因子。