Taylor J A, Johnson R S
Immunex Corporation, Seattle, Washington 98101-2936, USA.
Anal Chem. 2001 Jun 1;73(11):2594-604. doi: 10.1021/ac001196o.
There are several computer programs that can match peptide tandem mass spectrometry data to their exactly corresponding database sequences, and in most protein identification projects, these programs are utilized in the early stages of data interpretation. However, situations frequently arise where tandem mass spectral data cannot be correlated with any database sequences. In these cases, the unmatched data could be due to peptides derived from novel proteins, allelic or species-derived variants of known proteins, or posttranslational or chemical modifications. Two additional problems are frequently encountered in high-throughput protein identification. First, it is difficult to quickly sift through large amounts of data to identify those spectra that, due to poor signal or contaminants, can be ignored. Second, it is important to find incorrect database matches (false positives). We have chosen to address these difficulties by performing automatic de novo sequencing using a computer program called Lutefisk. Sequence candidates obtained are used as input in a homology-based database search program called CIDentify to identify variants of known proteins. Comparison of database-derived sequences with de novo sequences allows for electronic validation of database matches even if the latter are not completely correct. Modifications to the original Lutefisk program have been implemented to handle data obtained from triple quadrupole, ion trap, and quadrupole/time-of-flight hybrid (Qtof) mass spectrometers. For example, the linearity of mass errors due to temperature-dependent expansion of the flight tube in a Qtof was exploited such that isobaric amino acids (glutamine/lysine and oxidized methionine/ phenylalanine) can be differentiated without careful attention to mass calibration.
有几种计算机程序可以将肽串联质谱数据与其精确对应的数据库序列进行匹配,并且在大多数蛋白质鉴定项目中,这些程序在数据解释的早期阶段就会被使用。然而,经常会出现串联质谱数据无法与任何数据库序列相关联的情况。在这些情况下,无法匹配的数据可能是由于来自新蛋白质、已知蛋白质的等位基因或物种衍生变体,或翻译后或化学修饰的肽。在高通量蛋白质鉴定中还经常遇到另外两个问题。首先,很难快速筛选大量数据以识别那些由于信号不佳或存在污染物而可以忽略的光谱。其次,找出错误的数据库匹配项(假阳性)很重要。我们选择通过使用名为Lutefisk的计算机程序进行自动从头测序来解决这些困难。获得的序列候选物被用作名为CIDentify的基于同源性的数据库搜索程序的输入,以识别已知蛋白质的变体。将数据库衍生的序列与从头序列进行比较,即使后者不完全正确,也可以对数据库匹配项进行电子验证。已对原始Lutefisk程序进行了修改,以处理从三重四极杆、离子阱和四极杆/飞行时间混合(Qtof)质谱仪获得的数据。例如,利用了Qtof中飞行管因温度依赖性膨胀而导致的质量误差线性,这样无需仔细进行质量校准就能区分等压氨基酸(谷氨酰胺/赖氨酸和氧化甲硫氨酸/苯丙氨酸)。