Vacic Vladimir, Iakoucheva Lilia M, Lonardi Stefano, Radivojac Predrag
Department of Computer Science and Engineering, University of California, Riverside, California, USA.
J Comput Biol. 2010 Jan;17(1):55-72. doi: 10.1089/cmb.2009.0029.
We introduce a novel graph-based kernel method for annotating functional residues in protein structures. A structure is first modeled as a protein contact graph, where nodes correspond to residues and edges connect spatially neighboring residues. Each vertex in the graph is then represented as a vector of counts of labeled non-isomorphic subgraphs (graphlets), centered on the vertex of interest. A similarity measure between two vertices is expressed as the inner product of their respective count vectors and is used in a supervised learning framework to classify protein residues. We evaluated our method on two function prediction problems: identification of catalytic residues in proteins, which is a well-studied problem suitable for benchmarking, and a much less explored problem of predicting phosphorylation sites in protein structures. The performance of the graphlet kernel approach was then compared against two alternative methods, a sequence-based predictor and our implementation of the FEATURE framework. On both tasks, the graphlet kernel performed favorably; however, the margin of difference was considerably higher on the problem of phosphorylation site prediction. While there is data that phosphorylation sites are preferentially positioned in intrinsically disordered regions, we provide evidence that for the sites that are located in structured regions, neither the surface accessibility alone nor the averaged measures calculated from the residue microenvironments utilized by FEATURE were sufficient to achieve high accuracy. The key benefit of the graphlet representation is its ability to capture neighborhood similarities in protein structures via enumerating the patterns of local connectivity in the corresponding labeled graphs.
我们介绍了一种用于注释蛋白质结构中功能残基的基于图的新型核方法。首先将结构建模为蛋白质接触图,其中节点对应于残基,边连接空间上相邻的残基。然后,图中的每个顶点都表示为以感兴趣的顶点为中心的标记非同构子图(图元)计数向量。两个顶点之间的相似性度量表示为它们各自计数向量的内积,并用于监督学习框架中对蛋白质残基进行分类。我们在两个功能预测问题上评估了我们的方法:蛋白质中催化残基的识别,这是一个经过充分研究且适合作为基准的问题,以及预测蛋白质结构中磷酸化位点这个研究较少的问题。然后将图元核方法的性能与另外两种方法进行比较,一种基于序列的预测器和我们实现的FEATURE框架。在这两个任务上,图元核都表现良好;然而,在磷酸化位点预测问题上差异幅度要大得多。虽然有数据表明磷酸化位点优先位于内在无序区域,但我们提供的证据表明,对于位于结构化区域的位点,仅表面可及性或FEATURE所使用的从残基微环境计算出的平均度量都不足以实现高精度。图元表示的关键优势在于它能够通过枚举相应标记图中的局部连通性模式来捕捉蛋白质结构中的邻域相似性。