Spang R, Rehmsmeier M, Stoye J
German Cancer Research Center (DKFZ), Theoretical Bioinformatics, Heidelberg, Germany.
Proc Int Conf Intell Syst Mol Biol. 2000;8:367-75.
We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well balanced manner. This is in contrast to established methods like profiles and hidden Markov models which focus on vertical information as they model the columns of the alignment independently. In our setting, we want to select from a given database of "candidate sequences" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence at each position is aligned to one sequence of the multiple alignment, called the "reference sequence". In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compared it to hidden Markov models on a subset of the SCOP database of protein domains. The discriminative quality was assessed by counting the number of false positives that ranked higher than the first true positive (FP-count). For moderate FP-counts above five, the number of successful searches with our method was considerably higher than with hidden Markov models.
我们描述了一种用于氨基酸序列分类和检测远源同源物的新算法。其基本原理是以一种平衡的方式利用多重比对的纵向和横向信息。这与诸如轮廓模型和隐马尔可夫模型等已有的方法不同,后者在独立对比对的列进行建模时侧重于纵向信息。在我们的设定中,我们希望从给定的“候选序列”数据库中选择属于给定超家族的那些蛋白质。为了做到这一点,通过一种新的跳跃比对算法,将每个候选序列分别与超家族已知成员的多重比对进行测试。该算法是史密斯 - 沃特曼算法的扩展,用于计算单个序列与多重比对的局部比对。然而,与传统方法不同的是,这种比对不是基于多重比对中各个列的汇总。相反,候选序列在每个位置与多重比对中的一个序列(称为“参考序列”)进行比对。此外,参考序列在比对过程中可能会发生变化,而每次这样的跳跃都会受到惩罚。为了评估跳跃比对算法的判别质量,我们在蛋白质结构域的SCOP数据库的一个子集中将其与隐马尔可夫模型进行了比较。通过计算排名高于第一个真正阳性的假阳性数量(FP计数)来评估判别质量。对于高于5的中等FP计数,我们的方法成功搜索的次数明显高于隐马尔可夫模型。