BMC Bioinformatics. 2014;15 Suppl 2(Suppl 2):S3. doi: 10.1186/1471-2105-15-S2-S3. Epub 2014 Jan 24.
Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods.
Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families.
The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp.
蛋白质远程同源检测是生物信息学的核心问题之一,对于基础研究和实际应用都非常重要。目前,基于支持向量机(SVM)的判别方法取得了最新的性能。探索包含氨基酸或其他蛋白质构建块位置信息的特征向量是提高基于 SVM 方法性能的关键步骤。
提出了两种新的蛋白质远程同源检测方法,称为 SVM-DR 和 SVM-DT。SVM-DR 是一种基于序列的方法,其中蛋白质的特征向量表示基于残基对之间的距离。SVM-DT 是一种基于轮廓的方法,考虑了 Top-n-gram 对之间的距离。Top-n-gram 可以看作是基于蛋白质频率轮廓计算的基于轮廓的蛋白质构建块。这两种方法都是依赖位置的方法,包含了蛋白质序列的顺序信息。在包含 54 个家族和 23 个超家族的基准数据集上进行了各种实验。实验结果表明,这两种新方法非常有前途。与独立于位置的方法相比,性能有明显提高。此外,所提出的方法还可以为研究蛋白质家族的特征提供有用的见解。
所提出的方法的更好性能表明,依赖位置的方法对于蛋白质远程同源检测是有效的。我们的方法的另一个优势来自于显式特征空间表示,可用于分析蛋白质家族的特征特征。SVM-DT 和 SVM-DR 的源代码可在 http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp 获得。