Hou Yuna, Hsu Wynne, Lee Mong Li, Bystroff Christopher
School of Computing, National University of Singapore, Singapore 117543.
Bioinformatics. 2003 Nov 22;19(17):2294-301. doi: 10.1093/bioinformatics/btg317.
The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVM-Fisher and SVM-pairwise, which combine support vector machine (SVM) and sequence similarity, are recognized as the most accurate methods, with SVM-pairwise being the most accurate. However, these methods typically encode sequence information into their feature vectors and ignore the structure information. They are also computationally inefficient. Based on these observations, we present an alternative method for SVM-based protein classification. Our proposed method, SVM-I-sites, utilizes structure similarity for remote homology detection.
We run experiments on the Structural Classification of Proteins 1.53 data set. The results show that SVM-I-sites is more efficient than SVM-pairwise. Further, we find that SVM-I-sites outperforms sequence-based methods such as PSI-BLAST, SAM, and SVM-Fisher while achieving a comparable performance with SVM-pairwise.
I-sites server is accessible through the web at http://www.bioinfo.rpi.edu. Programs are available upon request for academics. Licensing agreements are available for commercial interests. The framework of encoding local structure into feature vector is available upon request.
如果我们能够将未知生物序列映射到其相应的同源家族,通常可以准确推断出该未知序列的功能。目前,诸如SVM-Fisher和SVM-成对法等结合支持向量机(SVM)和序列相似性的判别方法被认为是最准确的方法,其中SVM-成对法最为准确。然而,这些方法通常将序列信息编码到其特征向量中,而忽略了结构信息。它们在计算上也效率低下。基于这些观察结果,我们提出了一种基于SVM的蛋白质分类的替代方法。我们提出的方法SVM-I-sites利用结构相似性进行远程同源性检测。
我们在蛋白质结构分类1.53数据集上进行了实验。结果表明,SVM-I-sites比SVM-成对法更高效。此外,我们发现SVM-I-sites优于基于序列的方法,如PSI-BLAST、SAM和SVM-Fisher,同时与SVM-成对法具有相当的性能。
I-sites服务器可通过网络访问,网址为http://www.bioinfo.rpi.edu。程序可应学者要求提供。商业利益方可获得许可协议。将局部结构编码到特征向量中的框架可应要求提供。