Liu Bin, Chen Junjie, Wang Xiaolong
School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, Guangdong, People's Republic of China.
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, Guangdong, People's Republic of China.
Mol Genet Genomics. 2015 Oct;290(5):1919-31. doi: 10.1007/s00438-015-1044-4. Epub 2015 Apr 21.
Protein remote homology detection is one of the important tasks in computational proteomics, which is important for basic research and practical application. Currently, the SVM-based discriminative methods have shown superior performance. However, the existing feature vectors still cannot suitably represent the protein sequences, and often lack an interpretable model for analysis of characteristic features. Previous studies showed that sequence-order effects and physicochemical properties are important for representing protein sequences. However, how to use these kinds of information for constructing predictors is still a challenging problem. In this study, in order to incorporate the sequence-order information and physicochemical properties into the prediction, a method called disPseAAC is proposed, in which the feature vector is constructed by combining the occurrences of amino acid pairs within the Chou's pseudo amino acid composition (PseAAC) approach. The predictive performance and computational cost are further improved by employing the principal component analysis strategy. Various experiments are conducted on a benchmark dataset. Experimental results show that disPseAAC achieves an ROC score of 0.922, outperforming some existing state-of-the-art methods. Furthermore, the learnt model can easily be analyzed in terms of discriminative features, and the computational cost of the proposed method is much lower than that of other profile-based methods.
蛋白质远程同源性检测是计算蛋白质组学中的重要任务之一,对基础研究和实际应用都很重要。目前,基于支持向量机的判别方法已表现出卓越性能。然而,现有的特征向量仍无法恰当地表示蛋白质序列,且往往缺乏用于特征分析的可解释模型。先前的研究表明,序列顺序效应和物理化学性质对表示蛋白质序列很重要。然而,如何利用这类信息构建预测器仍是一个具有挑战性的问题。在本研究中,为了将序列顺序信息和物理化学性质纳入预测,提出了一种名为disPseAAC的方法,其中特征向量是通过在周的伪氨基酸组成(PseAAC)方法中结合氨基酸对的出现情况来构建的。通过采用主成分分析策略,进一步提高了预测性能和计算成本。在一个基准数据集上进行了各种实验。实验结果表明,disPseAAC的ROC得分为0.922,优于一些现有的先进方法。此外,所学习的模型可以很容易地根据判别特征进行分析,并且所提出方法的计算成本远低于其他基于轮廓的方法。