Wu T D, Brutlag D L
Section on Medical Informatics, Stanford University, California 94305, USA.
Proc Int Conf Intell Syst Mol Biol. 1995;3:402-10.
Analyzing a set of protein sequences involves a fundamental relationship between the coherency of the set and the specificity of the motif that describes it. Motifs may be obscured by training sets that contain incoherent sequences, in part due to protein subclasses, contamination, or errors. We develop an algorithm for motif identification that systematically explores possible patterns of coherency within a set of protein sequences. Our algorithm constructs alternative partitions of the training set data, where one subset of each partition is presumed to contain coherent data and is used for forming a motif. The motif is represented by multiple overlapping amino acid groups based on evolutionary, biochemical, or physical properties. We demonstrate our method on a training set of reverse transcriptases that contains subclasses, sequence errors, misalignments, and contaminating sequences. Despite these complications, our program identifies a novel motif for the subclass of retroviral and retrovirus-related reverse transcriptases. This motif has a much higher specificity than previously reported motifs and suggests the importance of conserved hydrophilic and hydrophobic residues in the structure of reverse transcriptases.
分析一组蛋白质序列涉及该序列集的连贯性与描述它的基序特异性之间的基本关系。基序可能会被包含不连贯序列的训练集所掩盖,部分原因是蛋白质亚类、污染或错误。我们开发了一种用于基序识别的算法,该算法系统地探索一组蛋白质序列中可能的连贯模式。我们的算法构建训练集数据的替代划分,其中每个划分的一个子集被假定包含连贯数据并用于形成基序。该基序由基于进化、生化或物理特性的多个重叠氨基酸组表示。我们在一个包含亚类、序列错误、错配和污染序列的逆转录酶训练集上展示了我们的方法。尽管存在这些复杂情况,我们的程序仍为逆转录病毒和逆转录病毒相关逆转录酶的亚类识别出一个新的基序。这个基序比以前报道的基序具有更高的特异性,并表明保守的亲水和疏水残基在逆转录酶结构中的重要性。