Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA.
BMC Genomics. 2011 Jul 11;12:358. doi: 10.1186/1471-2164-12-358.
Database searching is the most frequently used approach for automated peptide assignment and protein inference of tandem mass spectra. The results, however, depend on the sequences in target databases and on search algorithms. Recently by using an alternative splicing database, we identified more proteins than with the annotated proteins in Aspergillus flavus. In this study, we aimed at finding a greater number of eligible splice variants based on newly available transcript sequences and the latest genome annotation. The improved database was then used to compare four search algorithms: Mascot, OMSSA, X! Tandem, and InsPecT.
The updated alternative splicing database predicted 15833 putative protein variants, 61% more than the previous results. There was transcript evidence for 50% of the updated genes compared to the previous 35% coverage. Database searches were conducted using the same set of spectral data, search parameters, and protein database but with different algorithms. The false discovery rates of the peptide-spectrum matches were estimated < 2%. The numbers of the total identified proteins varied from 765 to 867 between algorithms. Whereas 42% (1651/3891) of peptide assignments were unanimous, the comparison showed that 51% (568/1114) of the RefSeq proteins and 15% (11/72) of the putative splice variants were inferred by all algorithms. 12 plausible isoforms were discovered by focusing on the consensus peptides which were detected by at least three different algorithms. The analysis found different conserved domains in two putative isoforms of UDP-galactose 4-epimerase.
We were able to detect dozens of new peptides using the improved alternative splicing database with the recently updated annotation of the A. flavus genome. Unlike the identifications of the peptides and the RefSeq proteins, large variations existed between the putative splice variants identified by different algorithms. 12 candidates of putative isoforms were reported based on the consensus peptide-spectrum matches. This suggests that applications of multiple search engines effectively reduced the possible false positive results and validated the protein identifications from tandem mass spectra using an alternative splicing database.
数据库搜索是自动化肽分配和串联质谱蛋白质推断最常用的方法。然而,结果取决于目标数据库中的序列和搜索算法。最近,我们使用替代剪接数据库,鉴定到的蛋白质比黄曲霉中注释的蛋白质更多。在这项研究中,我们旨在根据新获得的转录序列和最新的基因组注释找到更多合格的剪接变体。然后使用改进的数据库比较了四种搜索算法:Mascot、OMSSA、X!Tandem 和 InsPecT。
更新的替代剪接数据库预测了 15833 个假定的蛋白质变体,比之前的结果多 61%。与之前 35%的覆盖范围相比,有转录证据的更新基因占 50%。数据库搜索使用相同的光谱数据集、搜索参数和蛋白质数据库进行,但使用不同的算法。肽谱匹配的假发现率估计<2%。不同算法鉴定的总蛋白质数量在 765 到 867 之间变化。虽然 42%(1651/3891)的肽分配是一致的,但比较表明,所有算法都推断出 51%(568/1114)的 RefSeq 蛋白和 15%(11/72)的假定剪接变体。通过关注至少三种不同算法检测到的共识肽,发现了 12 个合理的同工型。分析发现,两种假定的 UDP-半乳糖 4-差向异构酶同工型中存在不同的保守结构域。
我们能够使用改进的替代剪接数据库和黄曲霉基因组的最新更新注释来检测数十个新的肽。与肽和 RefSeq 蛋白的鉴定不同,不同算法鉴定的假定剪接变体之间存在很大差异。根据共识肽谱匹配,报告了 12 个假定同工型的候选者。这表明应用多个搜索引擎可以有效地减少可能的假阳性结果,并使用替代剪接数据库验证串联质谱的蛋白质鉴定。