Patthy L
Institute of Enzymology, Hungarian Academy of Sciences, Budapest.
J Mol Biol. 1987 Dec 20;198(4):567-77. doi: 10.1016/0022-2836(87)90200-2.
A simple protocol is described that is suitable for the detection of distantly related members of a protein family. In this procedure, similarity to a consensus sequence is used to distinguish chance similarity from similarity due to common ancestry. The consensus sequence is constructed from the sequences of established members of a protein family and it incorporates features characteristic of the protein fold of this family: conserved residues, the pattern of variable and conserved segments, preferred location of gaps etc. The database is searched with the consensus sequence, using the unitary matrix or log odds matrix for scoring the alignments, with variable gap penalty. The advantage of the method is that it weights key residues, ignores sequence similarity in variable segments (thus partially eliminating "background noise" coming from chance similarity), distinguishes gaps disrupting conserved segments from those occurring in positions known to be tolerant of gap events. The utility of the method was demonstrated in the case of the protein family homologous with the internal repeats of complement B as well as the internal repeats identified in fibroblast proteoglycan PG40. The consensus sequence method succeeded in finding some new members of these protein families that could not be detected by earlier methods of sequence comparison.
本文描述了一种简单的方案,适用于检测蛋白质家族中亲缘关系较远的成员。在此过程中,与共有序列的相似性被用于区分偶然相似性和源于共同祖先的相似性。共有序列由蛋白质家族中已确定成员的序列构建而成,它包含了该家族蛋白质折叠的特征:保守残基、可变区和保守区的模式、缺口的优选位置等。使用共有序列搜索数据库,采用单位矩阵或对数几率矩阵对比对进行评分,并设置可变的缺口罚分。该方法的优点在于它对关键残基进行加权,忽略可变区的序列相似性(从而部分消除偶然相似性带来的“背景噪声”),区分破坏保守区的缺口和那些出现在已知可容忍缺口事件位置的缺口。该方法的实用性在与补体B内部重复序列同源的蛋白质家族以及成纤维细胞蛋白聚糖PG40中鉴定出的内部重复序列的案例中得到了证明。共有序列方法成功地找到了这些蛋白质家族的一些新成员,而这些成员用早期的序列比较方法无法检测到。