Håndstad Tony, Hestnes Arne J H, Saetrom Pål
Department of Computer and Information Science, Norwegian University of Science and Technology, NO-7052, Trondheim, Norway.
BMC Bioinformatics. 2007 Jan 25;8:23. doi: 10.1186/1471-2105-8-23.
Protein remote homology detection is a central problem in computational biology. Most recent methods train support vector machines to discriminate between related and unrelated sequences and these studies have introduced several types of kernels. One successful approach is to base a kernel on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences.
We introduce the GPkernel, which is a motif kernel based on discrete sequence motifs where the motifs are evolved using genetic programming. All proteins can be grouped according to evolutionary relations and structure, and the method uses this inherent structure to create groups of motifs that discriminate between different families of evolutionary origin. When tested on two SCOP benchmarks, the superfamily and fold recognition problems, the GPkernel gives significantly better results compared to related methods of remote homology detection.
The GPkernel gives particularly good results on the more difficult fold recognition problem compared to the other methods. This is mainly because the method creates motif sets that describe similarities among subgroups of both the related and unrelated proteins. This rich set of motifs give a better description of the similarities and differences between different folds than do previous motif-based methods.
蛋白质远程同源性检测是计算生物学中的核心问题。最近的大多数方法通过训练支持向量机来区分相关和不相关序列,这些研究引入了几种类型的核。一种成功的方法是基于离散序列基序的共享出现来构建核。然而,由于缺乏适合这些序列的基序集,许多蛋白质序列无法被正确分类。
我们引入了GPkernel,它是一种基于离散序列基序的基序核,其中基序通过遗传编程进行进化。所有蛋白质都可以根据进化关系和结构进行分组,该方法利用这种内在结构来创建区分不同进化起源家族的基序组。在两个SCOP基准测试(超家族和折叠识别问题)上进行测试时,与相关的远程同源性检测方法相比,GPkernel给出了显著更好的结果。
与其他方法相比,GPkernel在更困难的折叠识别问题上给出了特别好的结果。这主要是因为该方法创建的基序集描述了相关和不相关蛋白质亚组之间的相似性。与以前基于基序的方法相比,这一丰富的基序集能更好地描述不同折叠之间的异同。