Suppr超能文献

2307个变形菌门基因组的全蛋白质组聚类揭示了保守蛋白质和重大注释问题。

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.

作者信息

Lockwood Svetlana, Brayton Kelly A, Daily Jeff A, Broschat Shira L

机构信息

School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States.

Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States.

出版信息

Front Microbiol. 2019 Feb 28;10:383. doi: 10.3389/fmicb.2019.00383. eCollection 2019.

Abstract

We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta' (RpoB/RpoB'), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB' were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB' proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB' were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.

摘要

我们对从2307个已完成全测序的变形菌门基因组推导得到的876万个蛋白质序列进行了聚类,结果形成了707311个包含一个或多个序列的簇,其中224442个簇的大小在2到2894个序列之间。据我们所知,这是首次进行如此规模的研究。我们惊讶地发现,没有一个簇包含研究中所有生物体的代表性序列。基于最小基因组概念,我们预期会找到一组共享的蛋白质。为了确定这些簇为何没有普遍代表性,我们选择了四种必需蛋白质,伴侣蛋白GroEL、DNA依赖性RNA聚合酶亚基β和β'(RpoB/RpoB')以及DNA聚合酶I(PolA),它们代表了基本的细胞功能,并研究了它们的簇分布。我们发现这些蛋白质在某些情况下具有显著的保守性。尽管该基因在研究中的所有生物体中普遍保守,但该蛋白质并未在所有推导的蛋白质组中出现。RpoB和RpoB'的基因在两个基因组中缺失,并在88个基因组中合并,其序列差异足够大,以至于18个RpoB蛋白质(七个簇)和14个RpoB'蛋白质(三个簇)形成了单独的簇。对于PolA,52个生物体缺乏可识别的序列,7个序列差异足够大,以至于形成了五个单独的簇。有趣的是,缺乏可识别的PolA的生物体以及具有不同RpoB/RpoB'的生物体主要是内共生体。此外,我们给出了一系列注释问题的例子,这些问题导致推导的蛋白质在蛋白质组中的表示不正确。这些注释问题使我们确定蛋白质保守性的任务比预期更困难,并且也是高通量分析的一个重大障碍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc6f/6403173/d65c897e0426/fmicb-10-00383-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验