Zaki Nazar M, Deris Safaai, Illias Rosli
College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates.
Appl Bioinformatics. 2005;4(1):45-52. doi: 10.2165/00822942-200504010-00005.
The production of biological information has become much greater than its consumption. The key issue now is how to organise and manage the huge amount of novel information to facilitate access to this useful and important biological information. One core problem in classifying biological information is the annotation of new protein sequences with structural and functional features.
This article introduces the application of string kernels in classifying protein sequences into homogeneous families. A string kernel approach used in conjunction with support vector machines has been shown to achieve good performance in text categorisation tasks. We evaluated and analysed the performance of this approach, and we present experimental results on three selected families from the SCOP (Structural Classification of Proteins) database. We then compared the overall performance of this method with the existing protein classification methods on benchmark SCOP datasets.
According to the F1 performance measure and the rate of false positive (RFP) measure, the string kernel method performs well in classifying protein sequences. The method outperformed all the generative-based methods and is comparable with the SVM-Fisher method.
Although the string kernel approach makes no use of prior biological knowledge, it still captures sufficient biological information to enable it to outperform some of the state-of-the-art methods.
生物信息的产生已远远超过其消耗。当前的关键问题是如何组织和管理海量的新信息,以便于获取这些有用且重要的生物信息。对生物信息进行分类的一个核心问题是用结构和功能特征对新的蛋白质序列进行注释。
本文介绍了字符串核在将蛋白质序列分类为同源家族中的应用。一种与支持向量机结合使用的字符串核方法已被证明在文本分类任务中具有良好的性能。我们评估并分析了该方法的性能,并给出了来自蛋白质结构分类(SCOP)数据库中三个选定家族的实验结果。然后,我们在基准SCOP数据集上,将该方法的整体性能与现有的蛋白质分类方法进行了比较。
根据F1性能度量和误报率(RFP)度量,字符串核方法在蛋白质序列分类中表现良好。该方法优于所有基于生成模型的方法,并且与支持向量机-费舍尔方法相当。
尽管字符串核方法未使用先验生物学知识,但它仍然捕获了足够的生物信息,使其能够优于一些当前的先进方法。