Wu Kuen-Pin, Lin Hsin-Nan, Sung Ting-Yi, Hsu Wen-Lian
Institute of Information Sciences, Academia Sinica, Taipei, 115, Taiwan.
Proc IEEE Comput Soc Bioinform Conf. 2003;2:347-52.
Protein sequence analysis is an important tool to decode the logic of life. One of the most important similarity measures in this area is the edit distance between amino acids of two sequences. We believe this criterion should be reconsidered because protein features are probably associated more with small peptide fragments than with individual amino acids. In this paper, we design small patterns that are associated with highly conversed regions among a set of protein sequences. These patterns are used analogous to the index terms in information retrieval. Therefore, we do not consider gaps within patterns. This new similarity measure has been applied to phylogenetic tree construction, protein clustering and protein secondary structure prediction and has produced promising results.
蛋白质序列分析是解读生命逻辑的一项重要工具。该领域最重要的相似性度量之一是两条序列氨基酸之间的编辑距离。我们认为这一标准应该重新审视,因为蛋白质特征可能与小肽片段的关联度更高,而非单个氨基酸。在本文中,我们设计了与一组蛋白质序列中高度保守区域相关的小模式。这些模式类似于信息检索中的索引词那样使用。因此,我们不考虑模式内的空位。这种新的相似性度量已应用于系统发育树构建、蛋白质聚类和蛋白质二级结构预测,并取得了有前景的结果。