Dong Qi-Wen, Wang Xiao-Long, Lin Lei
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
Bioinformatics. 2006 Feb 1;22(3):285-90. doi: 10.1093/bioinformatics/bti801. Epub 2005 Nov 29.
Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the support vector machine (SVM) is one of the most effective methods. Many of the SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification.
In this study, a latent semantic analysis (LSA) model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the 'words' of 'protein sequence language', including N-grams, patterns and motifs. Each protein sequence is taken as a 'document' that is composed of bags-of-word. The word-document matrix is constructed first. The LSA is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the LSA model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise.
蛋白质序列之间的远程同源性检测是计算生物学中的一个核心问题。支持向量机(SVM)等判别方法是最有效的方法之一。许多基于SVM的方法专注于寻找蛋白质序列的有用表示,使用显式特征向量表示或核函数。由于特征通常非常大且可能引入噪声数据,这些表示在许多机器学习方法中可能会出现峰值现象。基于这些观察结果,本研究专注于支持向量机蛋白质分类中蛋白质向量的特征提取和有效表示。
在本研究中,一种潜在语义分析(LSA)模型被引入到蛋白质远程同源性检测中,该模型是一种来自自然语言处理的有效特征提取技术。已经研究了蛋白质序列的几个基本组成部分作为“蛋白质序列语言”的“单词”,包括N元语法、模式和基序。每个蛋白质序列都被视为一个由词袋组成的“文档”。首先构建词-文档矩阵。对该矩阵执行潜在语义分析以生成蛋白质序列的潜在语义表示向量,从而实现蛋白质序列的去噪和智能描述。然后通过支持向量机对潜在语义表示向量进行评估。该方法在SCOP 1.53数据库上进行了测试。结果表明,与基本形式主义相比,潜在语义分析模型显著提高了远程同源性检测的性能。此外,该方法的性能与SVM-LA等复杂核方法相当,且优于PSI-BLAST和SVM成对比较等其他基于序列的方法。