字符串核在蛋白质序列分类中的应用。

Application of string kernels in protein sequence classification.

作者信息

Zaki Nazar M, Deris Safaai, Illias Rosli

机构信息

College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates.

出版信息

Appl Bioinformatics. 2005;4(1):45-52. doi: 10.2165/00822942-200504010-00005.

DOI:10.2165/00822942-200504010-00005

PMID:16000012

Abstract

INTRODUCTION

The production of biological information has become much greater than its consumption. The key issue now is how to organise and manage the huge amount of novel information to facilitate access to this useful and important biological information. One core problem in classifying biological information is the annotation of new protein sequences with structural and functional features.

METHOD

This article introduces the application of string kernels in classifying protein sequences into homogeneous families. A string kernel approach used in conjunction with support vector machines has been shown to achieve good performance in text categorisation tasks. We evaluated and analysed the performance of this approach, and we present experimental results on three selected families from the SCOP (Structural Classification of Proteins) database. We then compared the overall performance of this method with the existing protein classification methods on benchmark SCOP datasets.

RESULTS

According to the F1 performance measure and the rate of false positive (RFP) measure, the string kernel method performs well in classifying protein sequences. The method outperformed all the generative-based methods and is comparable with the SVM-Fisher method.

DISCUSSION

Although the string kernel approach makes no use of prior biological knowledge, it still captures sufficient biological information to enable it to outperform some of the state-of-the-art methods.

摘要

引言

生物信息的产生已远远超过其消耗。当前的关键问题是如何组织和管理海量的新信息，以便于获取这些有用且重要的生物信息。对生物信息进行分类的一个核心问题是用结构和功能特征对新的蛋白质序列进行注释。

方法

本文介绍了字符串核在将蛋白质序列分类为同源家族中的应用。一种与支持向量机结合使用的字符串核方法已被证明在文本分类任务中具有良好的性能。我们评估并分析了该方法的性能，并给出了来自蛋白质结构分类（SCOP）数据库中三个选定家族的实验结果。然后，我们在基准SCOP数据集上，将该方法的整体性能与现有的蛋白质分类方法进行了比较。