关于构建细菌蛋白质基因组学蛋白质序列数据库时泛基因组和注释差异的影响

On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics.

作者信息

Machado Karla C T, Fortuin Suereta, Tomazella Gisele Guicardi, Fonseca Andre F, Warren Robin Mark, Wiker Harald G, de Souza Sandro Jose, de Souza Gustavo Antonio

机构信息

Bioinformatics Multidisciplinary Environment, Universidade Federal do Rio Grande do Norte, Natal, Brazil.

DST/NRF Centre of Excellence for Biomedical Tuberculosis Research/SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Stellenbosch, South Africa.

出版信息

Front Microbiol. 2019 Jun 20;10:1410. doi: 10.3389/fmicb.2019.01410. eCollection 2019.

DOI:10.3389/fmicb.2019.01410

PMID:31281302

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6596428/

Abstract

In proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, and generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from , and from two published datasets from . We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases.

摘要

在蛋白质组学中，来自特定生物体样本的质谱（MS）数据中的肽段信息通常会与最能代表该生物体的蛋白质序列数据库进行比对。然而，如果样本中的物种/菌株未知或遗传特征描述不足，那么确定一个能够代表该样本的数据库就会变得具有挑战性。构建合并给定物种多个菌株的定制蛋白质序列数据库已成为克服此类限制的一种策略。然而，随着越来越多的遗传信息公开可用，以及诸如物种内泛基因和核心基因的存在等有趣的遗传特征被揭示，我们质疑这种合并策略在报告相关信息方面的效率如何。为了验证这一假设，我们构建了包含10个不同物种的保守序列和独特序列的数据库。然后监测蛋白质组学中基于概率的蛋白质鉴定相关的特征。正如预期的那样，数据库复杂性的增加与泛基因组复杂性相关。然而，[此处可能有遗漏信息]甚至在泛基因组复杂性较低时也生成了非常复杂的数据库。我们通过使用来自[此处可能有遗漏信息]的8个临床菌株的MS数据以及来自[此处可能有遗漏信息]的两个已发表数据集进一步测试了数据库性能。我们表明，通过采用一种通过去除菌株/物种间重复的相同胰蛋白酶序列来控制数据库大小的方法，随着数据库复杂性的增加，计算时间可以大幅减少。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d81a/6596428/119dcb4e7545/fmicb-10-01410-g001.jpg

相似文献

On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics.

Front Microbiol. 2019 Jun 20;10:1410. doi: 10.3389/fmicb.2019.01410. eCollection 2019.

Identification of new protein coding sequences and signal peptidase cleavage sites of Helicobacter pylori strain 26695 by proteogenomics.

J Proteomics. 2013 Jun 28;86:27-42. doi: 10.1016/j.jprot.2013.04.036. Epub 2013 May 9.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

OryzaPG-DB: rice proteome database based on shotgun proteogenomics.

BMC Plant Biol. 2011 Apr 12;11:63. doi: 10.1186/1471-2229-11-63.

Proteomics in non-human primates: utilizing RNA-Seq data to improve protein identification by mass spectrometry in vervet monkeys.

BMC Genomics. 2017 Nov 13;18(1):877. doi: 10.1186/s12864-017-4279-0.

Identification of Yersinia pestis and Escherichia coli strains by whole cell and outer membrane protein extracts with mass spectrometry-based proteomics.

J Proteome Res. 2010 Jul 2;9(7):3647-55. doi: 10.1021/pr100402y.

Proteogenomics: From next-generation sequencing (NGS) and mass spectrometry-based proteomics to precision medicine.

Clin Chim Acta. 2019 Nov;498:38-46. doi: 10.1016/j.cca.2019.08.010. Epub 2019 Aug 14.

Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database.

Mol Cell Proteomics. 2011 Jan;10(1):M110.002527. doi: 10.1074/mcp.M110.002527. Epub 2010 Oct 28.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

Proteogenomics-Guided Evaluation of RNA-Seq Assembly and Protein Database Construction for Emergent Model Organisms.

Proteomics. 2020 May;20(10):e1900261. doi: 10.1002/pmic.201900261. Epub 2020 May 18.

引用本文的文献

A Mineral-Doped Micromodel Platform Demonstrates Fungal Bridging of Carbon Hot Spots and Hyphal Transport of Mineral-Derived Nutrients.

mSystems. 2022 Dec 20;7(6):e0091322. doi: 10.1128/msystems.00913-22. Epub 2022 Nov 17.

Identification and Characterization of Marine Microorganisms by Tandem Mass Spectrometry Proteotyping.

Microorganisms. 2022 Mar 26;10(4):719. doi: 10.3390/microorganisms10040719.

Building pan-genome infrastructures for crop plants and their use in association genetics.

DNA Res. 2021 Jan 19;28(1). doi: 10.1093/dnares/dsaa030.

本文引用的文献

Mining the cellular inventory of pyridoxal phosphate-dependent enzymes with functionalized cofactor mimics.

Nat Chem. 2018 Dec;10(12):1234-1245. doi: 10.1038/s41557-018-0144-2. Epub 2018 Oct 8.

Loose ends: almost one in five human genes still have unresolved coding status.

Nucleic Acids Res. 2018 Aug 21;46(14):7070-7084. doi: 10.1093/nar/gky587.

iMetaLab 1.0: a web platform for metaproteomics data analysis.

Bioinformatics. 2018 Nov 15;34(22):3954-3956. doi: 10.1093/bioinformatics/bty466.

MPA Portable: A Stand-Alone Software Package for Analyzing Metaproteome Samples on the Go.

Anal Chem. 2018 Jan 2;90(1):685-689. doi: 10.1021/acs.analchem.7b03544. Epub 2017 Dec 19.

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.

Genome Res. 2017 Dec;27(12):2083-2095. doi: 10.1101/gr.218255.116. Epub 2017 Nov 15.

panX: pan-genome analysis and exploration.

Nucleic Acids Res. 2018 Jan 9;46(1):e5. doi: 10.1093/nar/gkx977.

Proteomic and Metaproteomic Approaches to Understand Host-Microbe Interactions.

Anal Chem. 2018 Jan 2;90(1):86-109. doi: 10.1021/acs.analchem.7b04340. Epub 2017 Nov 9.

Challenges and perspectives of metaproteomic data analysis.

J Biotechnol. 2017 Nov 10;261:24-36. doi: 10.1016/j.jbiotec.2017.06.1201. Epub 2017 Jun 27.

Methods, Tools and Current Perspectives in Proteogenomics.

Mol Cell Proteomics. 2017 Jun;16(6):959-981. doi: 10.1074/mcp.MR117.000024. Epub 2017 Apr 29.

Why prokaryotes have pangenomes.

Nat Microbiol. 2017 Mar 28;2:17040. doi: 10.1038/nmicrobiol.2017.40.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

关于构建细菌蛋白质基因组学蛋白质序列数据库时泛基因组和注释差异的影响

On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献