Gupta Ravi, Mittal Ankush, Singh Kuldip
Department of Electronics and Computer Engineering, Indian Institute of Technology-Roorkee, Roorkee 247667, India.
IEEE Trans Inf Technol Biomed. 2008 Jul;12(4):541-8. doi: 10.1109/TITB.2007.911308.
G-protein coupled receptors (GPCRs) play a vital role in different biological processes, such as regulation of growth, death, and metabolism of cells. GPCRs are the focus of significant amount of current pharmaceutical research since they interact with more than 50% of prescription drugs. The dipeptide-based support vector machine (SVM) approach is the most accurate technique to identify and classify the GPCRs. However, this approach has two major disadvantages. First, the dimension of dipeptide-based feature vector is equal to 400. The large dimension makes the classification task computationally and memory wise inefficient. Second, it does not consider the biological properties of protein sequence for identification and classification of GPCRs. In this paper, we present a novel-feature-based SVM classification technique. The novel features are derived by applying wavelet-based time series analysis approach on protein sequences. The proposed feature space summarizes the variance information of seven important biological properties of amino acids in a protein sequence. In addition, the dimension of the feature vector for proposed technique is equal to 35. Experiments were performed on GPCRs protein sequences available at GPCRs Database. Our approach achieves an accuracy of 99.9%, 98.06%, 97.78%, and 94.08% for GPCR superfamily, families, subfamilies, and subsubfamilies (amine group), respectively, when evaluated using fivefold cross-validation. Further, an accuracy of 99.8%, 97.26%, and 97.84% was obtained when evaluated on unseen or recall datasets of GPCR superfamily, families, and subfamilies, respectively. Comparison with dipeptide-based SVM technique shows the effectiveness of our approach.
G蛋白偶联受体(GPCRs)在不同的生物过程中起着至关重要的作用,如细胞生长、死亡和代谢的调节。GPCRs是当前大量药物研究的重点,因为它们与超过50%的处方药相互作用。基于二肽的支持向量机(SVM)方法是识别和分类GPCRs最准确的技术。然而,这种方法有两个主要缺点。首先,基于二肽的特征向量维度等于400。高维度使得分类任务在计算和内存方面效率低下。其次,它在识别和分类GPCRs时没有考虑蛋白质序列的生物学特性。在本文中,我们提出了一种基于新特征的SVM分类技术。新特征是通过对蛋白质序列应用基于小波的时间序列分析方法得出的。所提出的特征空间总结了蛋白质序列中氨基酸七种重要生物学特性的方差信息。此外,所提出技术的特征向量维度等于35。对GPCRs数据库中可用的GPCRs蛋白质序列进行了实验。当使用五重交叉验证进行评估时,我们的方法对于GPCR超家族、家族、亚家族和亚亚家族(胺类)的准确率分别达到99.9%、98.06%、97.78%和94.08%。此外,当在GPCR超家族、家族和亚家族的未见或召回数据集上进行评估时,分别获得了99.8%、97.26%和97.84%的准确率。与基于二肽的SVM技术的比较表明了我们方法的有效性。