Taylor J A, Johnson R S
Department of Biochemistry, University of Washington, Seattle 98195-7350, USA.
Rapid Commun Mass Spectrom. 1997;11(9):1067-75. doi: 10.1002/(SICI)1097-0231(19970615)11:9<1067::AID-RCM953>3.0.CO;2-L.
A method is described for searching protein sequence databases using tandem mass spectra of tryptic peptides. The approach uses a de novo sequencing algorithm to derive a short list of possible sequence candidates which serve as query sequences in a subsequent homology-based database search routine. The sequencing algorithm employs a graph theory approach similar to previously described sequencing programs. In addition, amino acid composition, peptide sequence tags and incomplete or ambiguous Edman sequence data can be used to aid in the sequence determinations. Although sequencing of peptides from tandem mass spectra is possible, one of the frequently encountered difficulties is that several alternative sequences can be deduced from one spectrum. Most of the alternative sequences, however, are sufficiently similar for a homology-based sequence database search to be possible. Unfortunately, the available protein sequence database search algorithms (e.g. Blast or FASTA) require a single unambiguous sequence as input. Here we describe how the publicly available FASTA computer program was modified in order to search protein databases more effectively in spite of the ambiguities intrinsic in de novo peptide sequencing algorithms.
本文描述了一种利用胰蛋白酶肽段的串联质谱搜索蛋白质序列数据库的方法。该方法使用从头测序算法得出一份可能的序列候选短列表,这些候选序列在随后基于同源性的数据库搜索程序中用作查询序列。测序算法采用了一种类似于先前描述的测序程序的图论方法。此外,氨基酸组成、肽序列标签以及不完整或模糊的埃德曼序列数据可用于辅助序列确定。虽然从串联质谱对肽段进行测序是可行的,但经常遇到的一个困难是,从一个质谱图中可以推断出几个替代序列。然而,大多数替代序列足够相似,使得基于同源性的序列数据库搜索成为可能。不幸的是,现有的蛋白质序列数据库搜索算法(如Blast或FASTA)需要单个明确的序列作为输入。在此,我们描述了如何对公开可用的FASTA计算机程序进行修改,以便尽管从头肽测序算法存在内在的模糊性,但仍能更有效地搜索蛋白质数据库。