Yang Zheng Rong
Department of Computer Science, University of Exeter, Exeter, UK.
Brief Bioinform. 2004 Dec;5(4):328-38. doi: 10.1093/bib/5.4.328.
One of the major tasks in bioinformatics is the classification and prediction of biological data. With the rapid increase in size of the biological databanks, it is essential to use computer programs to automate the classification process. At present, the computer programs that give the best prediction performance are support vector machines (SVMs). This is because SVMs are designed to maximise the margin to separate two classes so that the trained model generalises well on unseen data. Most other computer programs implement a classifier through the minimisation of error occurred in training, which leads to poorer generalisation. Because of this, SVMs have been widely applied to many areas of bioinformatics including protein function prediction, protease functional site recognition, transcription initiation site prediction and gene expression data classification. This paper will discuss the principles of SVMs and the applications of SVMs to the analysis of biological data, mainly protein and DNA sequences.
生物信息学的主要任务之一是生物数据的分类和预测。随着生物数据库规模的迅速增长,使用计算机程序自动进行分类过程至关重要。目前,具有最佳预测性能的计算机程序是支持向量机(SVM)。这是因为支持向量机旨在最大化两类之间的间隔,以便训练好的模型能很好地推广到未见过的数据上。大多数其他计算机程序通过最小化训练中出现的误差来实现分类器,这导致泛化能力较差。因此,支持向量机已被广泛应用于生物信息学的许多领域,包括蛋白质功能预测、蛋白酶功能位点识别、转录起始位点预测和基因表达数据分类。本文将讨论支持向量机的原理及其在生物数据分析(主要是蛋白质和DNA序列分析)中的应用。