Standley Daron M, Toh Hiroyuki, Nakamura Haruki
Research Center for Structural and Functional Proteomics, Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan.
Proteins. 2008 Sep;72(4):1333-51. doi: 10.1002/prot.22015.
A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized.
提出了一种基于新型结构比对评分函数对结构基因组学靶点进行功能注释的方法。在所提出的评分中,使用位置特异性评分矩阵对结构比对的残基对进行加权,以突出进化上保守的基序。评分的函数形式首先针对区分属于同一Pfam家族的结构域与属于不同家族但属于同一CATH或SCOP超家族的结构域进行优化。在优化阶段,我们考虑了四个标准加权函数以及我们自己的“最大替换概率”函数,以及这些函数的组合。在识别结构域对的序列唯一基准集中的Pfam家族方面,优化后的评分在接收者操作特征曲线下的面积达到了0.87。然后从真阳性评分的基准分布中得出置信度度量。接下来,将比对方法应用于对作为日本蛋白质3000结构基因组学项目一部分向公众发布的230个查询蛋白进行功能注释的任务。在这些查询中,发现78个与具有与查询相同Pfam家族的模板比对,或者序列同一性≥30%。另外49个查询被发现与关系更远的模板匹配。在这一组中,我们的方法预测为最接近功能相关的模板通常不是结构上最相似的。详细讨论了几个重要的案例。最后,103个查询在折叠水平上与模板匹配,但在家族或超家族水平上不匹配,并且在功能上仍然未得到表征。