Wieser Daniela, Niranjan Mahesan
The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
In Silico Biol. 2009;9(3):89-103.
Distant evolutionary relationships between proteins with low sequence similarity are difficult to recognise by computational methods. Consequently, many sequences obtained from large-scale sequencing projects cannot be assigned to any known proteins or families despite being evolutionarily related. To boost sensitivity, various sequence-based methods have been modified to make use of the better conserved secondary structure. Most of these methods are instance-based or generative. Here, we introduce a kernel-based remote homology detection method that allows for a combination of sequence and secondary-structure similarity scores in a discriminative approach. We studied the ability of the method to predict superfamily membership as defined by the SCOP database. We show that a kernel method that combined sequence similarity scores with predicted secondary-structure similarity scores performed similar to a classifier that used scores calculated from sequences and true secondary structures, but performed better than a sequence-only based classifier and achieved a better mean than recently published results on the same data-set. It can be concluded that SVM classifiers trained to predict homology between distantly related proteins, become more accurate, if a joint sequence/secondary-structure similarity score approach is used.
序列相似性较低的蛋白质之间的远缘进化关系很难通过计算方法识别。因此,尽管从大规模测序项目中获得的许多序列在进化上相关,但它们无法被归类到任何已知的蛋白质或蛋白家族中。为了提高敏感性,各种基于序列的方法已被改进,以利用保守性更好的二级结构。这些方法大多基于实例或生成式。在此,我们介绍一种基于核的远程同源性检测方法,该方法允许在判别式方法中结合序列和二级结构相似性得分。我们研究了该方法预测由SCOP数据库定义的超家族成员的能力。我们表明,一种将序列相似性得分与预测的二级结构相似性得分相结合的核方法,其表现与使用从序列和真实二级结构计算出的得分的分类器相似,但比仅基于序列的分类器表现更好,并且在同一数据集上比最近发表的结果有更好的平均值。可以得出结论,如果使用联合序列/二级结构相似性得分方法,训练用于预测远缘相关蛋白质之间同源性的支持向量机分类器会变得更加准确。