用于识别SCOP关系的词组成距离的比较评估。

Comparative evaluation of word composition distances for the recognition of SCOP relationships.

作者信息

Vinga Susana, Gouveia-Oliveira Rodrigo, Almeida Jonas S

机构信息

Biomathematics Group, ITQB, Universidade Nova de Lisboa, Rua da Quinta Grande, n. 6, 2780-156 Oeiras, Portugal.

出版信息

Bioinformatics. 2004 Jan 22;20(2):206-15. doi: 10.1093/bioinformatics/btg392.

DOI:10.1093/bioinformatics/btg392

PMID:14734312

Abstract

MOTIVATION

Alignment-free metrics were recently reviewed by the authors, but have not until now been object of a comparative study. This paper compares the classification accuracy of word composition metrics therein reviewed. It also presents a new distance definition between protein sequences, the W-metric, which bridges between alignment metrics, such as scores produced by the Smith-Waterman algorithm, and methods based solely in L-tuple composition, such as Euclidean distance and Information content.

RESULTS

The comparative study reported here used the SCOP/ASTRAL protein structure hierarchical database and accessed the discriminant value of alternative sequence dissimilarity measures by calculating areas under the Receiver Operating Characteristic curves. Although alignment methods resulted in very good classification accuracy at family and superfamily levels, alignment-free distances, in particular Standard Euclidean Distance, are as good as alignment algorithms when sequence similarity is smaller, such as for recognition of fold or class relationships. This observation justifies its advantageous use to pre-filter homologous proteins since word statistics techniques are computed much faster than the alignment methods.

AVAILABILITY

All MATLAB code used to generate the data is available upon request to the authors. Additional material available at http://bioinformatics.musc.edu/wmetric

摘要

动机

作者最近对无比对度量进行了综述，但截至目前尚未成为比较研究的对象。本文比较了其中所综述的词组成度量的分类准确性。它还提出了一种蛋白质序列之间的新距离定义，即W度量，它在比对度量（如史密斯 - 沃特曼算法产生的得分）和仅基于L元组组成的方法（如欧几里得距离和信息含量）之间架起了桥梁。

结果

此处报道的比较研究使用了SCOP/ASTRAL蛋白质结构层次数据库，并通过计算受试者工作特征曲线下的面积来获取替代序列差异度量的判别值。尽管比对方法在家族和超家族水平上产生了非常好的分类准确性，但当序列相似性较小时，例如用于识别折叠或类别关系时，无比对距离，特别是标准欧几里得距离，与比对算法一样好。这一观察结果证明了其在预筛选同源蛋白质方面的优势使用，因为词统计技术的计算速度比比对方法快得多。

可用性

用于生成数据的所有MATLAB代码可根据作者要求提供。其他材料可在http://bioinformatics.musc.edu/wmetric获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于识别SCOP关系的词组成距离的比较评估。

Comparative evaluation of word composition distances for the recognition of SCOP relationships.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

用于识别SCOP关系的词组成距离的比较评估。

Comparative evaluation of word composition distances for the recognition of SCOP relationships.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献