Lockwood Svetlana, Brayton Kelly A, Daily Jeff A, Broschat Shira L
School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States.
Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States.
Front Microbiol. 2019 Feb 28;10:383. doi: 10.3389/fmicb.2019.00383. eCollection 2019.
We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta' (RpoB/RpoB'), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB' were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB' proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB' were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.
我们对从2307个已完成全测序的变形菌门基因组推导得到的876万个蛋白质序列进行了聚类,结果形成了707311个包含一个或多个序列的簇,其中224442个簇的大小在2到2894个序列之间。据我们所知,这是首次进行如此规模的研究。我们惊讶地发现,没有一个簇包含研究中所有生物体的代表性序列。基于最小基因组概念,我们预期会找到一组共享的蛋白质。为了确定这些簇为何没有普遍代表性,我们选择了四种必需蛋白质,伴侣蛋白GroEL、DNA依赖性RNA聚合酶亚基β和β'(RpoB/RpoB')以及DNA聚合酶I(PolA),它们代表了基本的细胞功能,并研究了它们的簇分布。我们发现这些蛋白质在某些情况下具有显著的保守性。尽管该基因在研究中的所有生物体中普遍保守,但该蛋白质并未在所有推导的蛋白质组中出现。RpoB和RpoB'的基因在两个基因组中缺失,并在88个基因组中合并,其序列差异足够大,以至于18个RpoB蛋白质(七个簇)和14个RpoB'蛋白质(三个簇)形成了单独的簇。对于PolA,52个生物体缺乏可识别的序列,7个序列差异足够大,以至于形成了五个单独的簇。有趣的是,缺乏可识别的PolA的生物体以及具有不同RpoB/RpoB'的生物体主要是内共生体。此外,我们给出了一系列注释问题的例子,这些问题导致推导的蛋白质在蛋白质组中的表示不正确。这些注释问题使我们确定蛋白质保守性的任务比预期更困难,并且也是高通量分析的一个重大障碍。