Machado Karla C T, Fortuin Suereta, Tomazella Gisele Guicardi, Fonseca Andre F, Warren Robin Mark, Wiker Harald G, de Souza Sandro Jose, de Souza Gustavo Antonio
Bioinformatics Multidisciplinary Environment, Universidade Federal do Rio Grande do Norte, Natal, Brazil.
DST/NRF Centre of Excellence for Biomedical Tuberculosis Research/SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Stellenbosch, South Africa.
Front Microbiol. 2019 Jun 20;10:1410. doi: 10.3389/fmicb.2019.01410. eCollection 2019.
In proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, and generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from , and from two published datasets from . We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases.
在蛋白质组学中,来自特定生物体样本的质谱(MS)数据中的肽段信息通常会与最能代表该生物体的蛋白质序列数据库进行比对。然而,如果样本中的物种/菌株未知或遗传特征描述不足,那么确定一个能够代表该样本的数据库就会变得具有挑战性。构建合并给定物种多个菌株的定制蛋白质序列数据库已成为克服此类限制的一种策略。然而,随着越来越多的遗传信息公开可用,以及诸如物种内泛基因和核心基因的存在等有趣的遗传特征被揭示,我们质疑这种合并策略在报告相关信息方面的效率如何。为了验证这一假设,我们构建了包含10个不同物种的保守序列和独特序列的数据库。然后监测蛋白质组学中基于概率的蛋白质鉴定相关的特征。正如预期的那样,数据库复杂性的增加与泛基因组复杂性相关。然而,[此处可能有遗漏信息]甚至在泛基因组复杂性较低时也生成了非常复杂的数据库。我们通过使用来自[此处可能有遗漏信息]的8个临床菌株的MS数据以及来自[此处可能有遗漏信息]的两个已发表数据集进一步测试了数据库性能。我们表明,通过采用一种通过去除菌株/物种间重复的相同胰蛋白酶序列来控制数据库大小的方法,随着数据库复杂性的增加,计算时间可以大幅减少。