2307个变形菌门基因组的全蛋白质组聚类揭示了保守蛋白质和重大注释问题。

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.

作者信息

Lockwood Svetlana, Brayton Kelly A, Daily Jeff A, Broschat Shira L

机构信息

School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States.

Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States.

出版信息

Front Microbiol. 2019 Feb 28;10:383. doi: 10.3389/fmicb.2019.00383. eCollection 2019.

DOI:10.3389/fmicb.2019.00383

PMID:30873148

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6403173/

Abstract

We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta' (RpoB/RpoB'), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB' were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB' proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB' were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.

摘要

我们对从2307个已完成全测序的变形菌门基因组推导得到的876万个蛋白质序列进行了聚类，结果形成了707311个包含一个或多个序列的簇，其中224442个簇的大小在2到2894个序列之间。据我们所知，这是首次进行如此规模的研究。我们惊讶地发现，没有一个簇包含研究中所有生物体的代表性序列。基于最小基因组概念，我们预期会找到一组共享的蛋白质。为了确定这些簇为何没有普遍代表性，我们选择了四种必需蛋白质，伴侣蛋白GroEL、DNA依赖性RNA聚合酶亚基β和β'（RpoB/RpoB'）以及DNA聚合酶I（PolA），它们代表了基本的细胞功能，并研究了它们的簇分布。我们发现这些蛋白质在某些情况下具有显著的保守性。尽管该基因在研究中的所有生物体中普遍保守，但该蛋白质并未在所有推导的蛋白质组中出现。RpoB和RpoB'的基因在两个基因组中缺失，并在88个基因组中合并，其序列差异足够大，以至于18个RpoB蛋白质（七个簇）和14个RpoB'蛋白质（三个簇）形成了单独的簇。对于PolA，52个生物体缺乏可识别的序列，7个序列差异足够大，以至于形成了五个单独的簇。有趣的是，缺乏可识别的PolA的生物体以及具有不同RpoB/RpoB'的生物体主要是内共生体。此外，我们给出了一系列注释问题的例子，这些问题导致推导的蛋白质在蛋白质组中的表示不正确。这些注释问题使我们确定蛋白质保守性的任务比预期更困难，并且也是高通量分析的一个重大障碍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc6f/6403173/d65c897e0426/fmicb-10-00383-g001.jpg

相似文献

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.2307个变形菌门基因组的全蛋白质组聚类揭示了保守蛋白质和重大注释问题。

Front Microbiol. 2019 Feb 28;10:383. doi: 10.3389/fmicb.2019.00383. eCollection 2019.

Evolution of the RNA polymerase B' subunit gene (rpoB') in Halobacteriales: a complementary molecular marker to the SSU rRNA gene.嗜盐杆菌目RNA聚合酶B'亚基基因（rpoB'）的进化：16S核糖体RNA基因的互补分子标记

Mol Biol Evol. 2004 Dec;21(12):2340-51. doi: 10.1093/molbev/msh248. Epub 2004 Sep 8.

Further refinement of the phylogeny of the Halobacteriaceae based on the full-length RNA polymerase subunit B' (rpoB') gene.基于全长 RNA 聚合酶亚基 B'（rpoB'）基因对盐杆菌科的系统发育进行进一步细化。

Int J Syst Evol Microbiol. 2010 Oct;60(Pt 10):2398-2408. doi: 10.1099/ijs.0.017160-0. Epub 2009 Nov 27.

Phylogenetic relationships within the family Halobacteriaceae inferred from rpoB' gene and protein sequences.基于rpoB'基因和蛋白质序列推断盐杆菌科内的系统发育关系。

Int J Syst Evol Microbiol. 2007 Oct;57(Pt 10):2289-2295. doi: 10.1099/ijs.0.65190-0.

Partial sequence comparison of the rpoB, sodA, groEL and gyrB genes within the genus Streptococcus.链球菌属内rpoB、sodA、groEL和gyrB基因的部分序列比较。

Int J Syst Evol Microbiol. 2009 Sep;59(Pt 9):2317-22. doi: 10.1099/ijs.0.005488-0. Epub 2009 Jul 20.

Dissection of phylogenetic relationships among 19 rapidly growing Mycobacterium species by 16S rRNA, hsp65, sodA, recA and rpoB gene sequencing.通过16S rRNA、hsp65、sodA、recA和rpoB基因测序剖析19种快速生长分枝杆菌物种之间的系统发育关系。

Int J Syst Evol Microbiol. 2004 Nov;54(Pt 6):2095-2105. doi: 10.1099/ijs.0.63094-0.

Evaluation of the DNA-dependent RNA polymerase β-subunit gene (rpoB) for phytoplasma classification and phylogeny.评价 DNA 依赖性 RNA 聚合酶β亚基基因（rpoB）在植原体分类和系统发育中的作用。

Int J Syst Evol Microbiol. 2013 Oct;63(Pt 10):3904-3914. doi: 10.1099/ijs.0.051912-0.

Annotation of Protein Domains Reveals Remarkable Conservation in the Functional Make up of Proteomes Across Superkingdoms.注释蛋白质结构域揭示了超界中蛋白质组在功能组成上的显著保守性。

Genes (Basel). 2011 Nov 8;2(4):869-911. doi: 10.3390/genes2040869.

Clustering analysis of proteins from microbial genomes at multiple levels of resolution.对微生物基因组中的蛋白质进行多分辨率水平的聚类分析。

BMC Bioinformatics. 2016 Aug 31;17 Suppl 8(Suppl 8):276. doi: 10.1186/s12859-016-1112-8.

The human "magnesome": detecting magnesium binding sites on human proteins.人类“镁组学”：检测人类蛋白质上的镁结合位点。

BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S10. doi: 10.1186/1471-2105-13-S14-S10. Epub 2012 Sep 7.

引用本文的文献

PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks.PASS：使用同源簇、自然语言处理和序列相似性网络进行蛋白质注释的蛋白质注释监测站点。

Front Bioinform. 2021 Sep 29;1:749008. doi: 10.3389/fbinf.2021.749008. eCollection 2021.

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.大规模基于 k-mer 的基因组信息特性分析、比较基因组学和分类学。

PLoS One. 2021 Oct 14;16(10):e0258693. doi: 10.1371/journal.pone.0258693. eCollection 2021.

A Need for Improved Cellulase Identification from Metagenomic Sequence Data.需要从宏基因组序列数据中改进纤维素酶的鉴定。

Appl Environ Microbiol. 2020 Dec 17;87(1). doi: 10.1128/AEM.01928-20.

Assessment of a Potential Role of DSM 18020 as a Pectinase Producer for Utilization in Poultry Diets Based on Analyses.基于分析评估DSM 18020作为果胶酶产生菌在禽类日粮中的潜在作用。

Front Microbiol. 2020 Apr 23;11:751. doi: 10.3389/fmicb.2020.00751. eCollection 2020.

本文引用的文献

Misannotation Awareness: A Tale of Two Gene-Groups.错误注释认知：两个基因组的故事

Front Plant Sci. 2016 Jun 16;7:868. doi: 10.3389/fpls.2016.00868. eCollection 2016.

Design and synthesis of a minimal bacterial genome.最小细菌基因组的设计与合成。

Science. 2016 Mar 25;351(6280):aad6253. doi: 10.1126/science.aad6253.

Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments.Parasail：用于全局、半全局和局部成对序列比对的SIMD C库。

BMC Bioinformatics. 2016 Feb 10;17:81. doi: 10.1186/s12859-016-0930-z.

Babela massiliensis, a representative of a widespread bacterial phylum with unusual adaptations to parasitism in amoebae.马赛巴贝拉菌，是一个广泛存在的细菌门类的代表，对寄生在变形虫中有不同寻常的适应性。

Biol Direct. 2015 Mar 31;10:13. doi: 10.1186/s13062-015-0043-z.

Next-generation sequencing and large genome assemblies.下一代测序和大型基因组组装。

Pharmacogenomics. 2012 Jun;13(8):901-15. doi: 10.2217/pgs.12.72.

Whole genome sequencing of environmental Vibrio cholerae O1 from 10 nanograms of DNA using short reads.使用短读长对 10 纳克 DNA 中的环境霍乱弧菌 O1 进行全基因组测序。

J Microbiol Methods. 2011 Nov;87(2):208-12. doi: 10.1016/j.mimet.2011.08.003. Epub 2011 Aug 17.

Using comparative genome analysis to identify problems in annotated microbial genomes.利用比较基因组分析鉴定注释微生物基因组中的问题。

Microbiology (Reading). 2010 Jul;156(Pt 7):1909-1917. doi: 10.1099/mic.0.033811-0. Epub 2010 Apr 29.

Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.公共数据库中的注释错误：酶超家族中分子功能的错误注释。

PLoS Comput Biol. 2009 Dec;5(12):e1000605. doi: 10.1371/journal.pcbi.1000605. Epub 2009 Dec 11.

RNRdb, a curated database of the universal enzyme family ribonucleotide reductase, reveals a high level of misannotation in sequences deposited to Genbank.RNRdb，一个经过精心整理的通用酶家族核糖核苷酸还原酶数据库，揭示了 Genbank 中存储的序列存在高度错误注释的情况。

BMC Genomics. 2009 Dec 8;10:589. doi: 10.1186/1471-2164-10-589.

RNA polymerase fidelity and transcriptional proofreading.RNA 聚合酶保真度和转录校读。

Curr Opin Struct Biol. 2009 Dec;19(6):732-9. doi: 10.1016/j.sbi.2009.10.009. Epub 2009 Nov 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

2307个变形菌门基因组的全蛋白质组聚类揭示了保守蛋白质和重大注释问题。

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献