Roos Franz F, Jacob Riko, Grossmann Jonas, Fischer Bernd, Buhmann Joachim M, Gruissem Wilhelm, Baginsky Sacha, Widmayer Peter
Institute of Theoretical Computer Science, Institute of Plant Science, Institute of Computational Science, ETH Zurich, CH-8092 Zurich, Switzerland.
Bioinformatics. 2007 Nov 15;23(22):3016-23. doi: 10.1093/bioinformatics/btm417. Epub 2007 Sep 3.
Tandem mass spectrometry allows for high-throughput identification of complex protein samples. Searching tandem mass spectra against sequence databases is the main analysis method nowadays. Since many peptide variations are possible, including them in the search space seems only logical. However, the search space usually grows exponentially with the number of independent variations and may therefore overwhelm computational resources.
We provide fast, cache-efficient search algorithms to screen large peptide search spaces including non-tryptic peptides, whole genomes, dozens of posttranslational modifications, unannotated point mutations and even unannotated splice sites. All these search spaces can be screened simultaneously. By optimizing the cache usage, we achieve a calculation speed that closely approaches the limits of the hardware. At the same time, we control the size of the overall search space by limiting the combinations of variations that can co-occur on the same peptide. Using a hypergeometric scoring scheme, we applied these algorithms to a dataset of 1 420 632 spectra. We were able to identify a considerable number of peptide variations within a modest amount of computing time on standard desktop computers.
串联质谱法可实现对复杂蛋白质样品的高通量鉴定。如今,针对序列数据库搜索串联质谱图是主要的分析方法。由于可能存在多种肽段变异形式,将它们纳入搜索空间似乎是合理的。然而,搜索空间通常会随着独立变异数量呈指数增长,因此可能会耗尽计算资源。
我们提供了快速、高效缓存的搜索算法,用于筛选大型肽段搜索空间,包括非胰蛋白酶肽段、全基因组、数十种翻译后修饰、未注释的点突变甚至未注释的剪接位点。所有这些搜索空间都可以同时进行筛选。通过优化缓存使用,我们实现了接近硬件极限的计算速度。同时,我们通过限制同一肽段上可能同时出现的变异组合来控制整体搜索空间的大小。使用超几何评分方案,我们将这些算法应用于一个包含1420632个质谱图的数据集。在标准台式计算机上,我们能够在适度的计算时间内识别出大量的肽段变异。