Rögnvaldsson Thorsteinn, You Liwen
Intelligent Systems Laboratory, School of Information Science, Computer and Electrical Engineering, Halmstad University, Box 823, 301 18 Sweden.
Bioinformatics. 2004 Jul 22;20(11):1702-9. doi: 10.1093/bioinformatics/bth144. Epub 2004 Feb 26.
Several papers have been published where nonlinear machine learning algorithms, e.g. artificial neural networks, support vector machines and decision trees, have been used to model the specificity of the HIV-1 protease and extract specificity rules. We show that the dataset used in these studies is linearly separable and that it is a misuse of nonlinear classifiers to apply them to this problem. The best solution on this dataset is achieved using a linear classifier like the simple perceptron or the linear support vector machine, and it is straightforward to extract rules from these linear models. We identify key residues in peptides that are efficiently cleaved by the HIV-1 protease and list the most prominent rules, relating them to experimental results for the HIV-1 protease.
Understanding HIV-1 protease specificity is important when designing HIV inhibitors and several different machine learning algorithms have been applied to the problem. However, little progress has been made in understanding the specificity because nonlinear and overly complex models have been used.
We show that the problem is much easier than what has previously been reported and that linear classifiers like the simple perceptron or linear support vector machines are at least as good predictors as nonlinear algorithms. We also show how sets of specificity rules can be generated from the resulting linear classifiers.
The datasets used are available at http://www.hh.se/staff/bioinf/
已有几篇论文发表,其中使用了非线性机器学习算法,例如人工神经网络、支持向量机和决策树,来对HIV-1蛋白酶的特异性进行建模并提取特异性规则。我们表明,这些研究中使用的数据集是线性可分的,将非线性分类器应用于此问题属于滥用。在这个数据集上,使用简单感知器或线性支持向量机等线性分类器可获得最佳解决方案,并且从这些线性模型中提取规则很简单。我们确定了被HIV-1蛋白酶有效切割的肽段中的关键残基,并列出了最突出的规则,将它们与HIV-1蛋白酶的实验结果相关联。
在设计HIV抑制剂时,了解HIV-1蛋白酶的特异性很重要,并且已经将几种不同的机器学习算法应用于该问题。然而,由于使用了非线性和过于复杂的模型,在理解特异性方面进展甚微。
我们表明,该问题比之前报道的要容易得多,并且简单感知器或线性支持向量机等线性分类器至少与非线性算法一样是良好的预测器。我们还展示了如何从所得的线性分类器中生成特异性规则集。
所使用的数据集可在http://www.hh.se/staff/bioinf/获取。