• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于识别SCOP关系的词组成距离的比较评估。

Comparative evaluation of word composition distances for the recognition of SCOP relationships.

作者信息

Vinga Susana, Gouveia-Oliveira Rodrigo, Almeida Jonas S

机构信息

Biomathematics Group, ITQB, Universidade Nova de Lisboa, Rua da Quinta Grande, n. 6, 2780-156 Oeiras, Portugal.

出版信息

Bioinformatics. 2004 Jan 22;20(2):206-15. doi: 10.1093/bioinformatics/btg392.

DOI:10.1093/bioinformatics/btg392
PMID:14734312
Abstract

MOTIVATION

Alignment-free metrics were recently reviewed by the authors, but have not until now been object of a comparative study. This paper compares the classification accuracy of word composition metrics therein reviewed. It also presents a new distance definition between protein sequences, the W-metric, which bridges between alignment metrics, such as scores produced by the Smith-Waterman algorithm, and methods based solely in L-tuple composition, such as Euclidean distance and Information content.

RESULTS

The comparative study reported here used the SCOP/ASTRAL protein structure hierarchical database and accessed the discriminant value of alternative sequence dissimilarity measures by calculating areas under the Receiver Operating Characteristic curves. Although alignment methods resulted in very good classification accuracy at family and superfamily levels, alignment-free distances, in particular Standard Euclidean Distance, are as good as alignment algorithms when sequence similarity is smaller, such as for recognition of fold or class relationships. This observation justifies its advantageous use to pre-filter homologous proteins since word statistics techniques are computed much faster than the alignment methods.

AVAILABILITY

All MATLAB code used to generate the data is available upon request to the authors. Additional material available at http://bioinformatics.musc.edu/wmetric

摘要

动机

作者最近对无比对度量进行了综述,但截至目前尚未成为比较研究的对象。本文比较了其中所综述的词组成度量的分类准确性。它还提出了一种蛋白质序列之间的新距离定义,即W度量,它在比对度量(如史密斯 - 沃特曼算法产生的得分)和仅基于L元组组成的方法(如欧几里得距离和信息含量)之间架起了桥梁。

结果

此处报道的比较研究使用了SCOP/ASTRAL蛋白质结构层次数据库,并通过计算受试者工作特征曲线下的面积来获取替代序列差异度量的判别值。尽管比对方法在家族和超家族水平上产生了非常好的分类准确性,但当序列相似性较小时,例如用于识别折叠或类别关系时,无比对距离,特别是标准欧几里得距离,与比对算法一样好。这一观察结果证明了其在预筛选同源蛋白质方面的优势使用,因为词统计技术的计算速度比比对方法快得多。

可用性

用于生成数据的所有MATLAB代码可根据作者要求提供。其他材料可在http://bioinformatics.musc.edu/wmetric获取。

相似文献

1
Comparative evaluation of word composition distances for the recognition of SCOP relationships.用于识别SCOP关系的词组成距离的比较评估。
Bioinformatics. 2004 Jan 22;20(2):206-15. doi: 10.1093/bioinformatics/btg392.
2
ProClust: improved clustering of protein sequences with an extended graph-based approach.ProClust:基于扩展的图形方法改进蛋白质序列聚类
Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.
3
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法:一种用于判别式多类别蛋白质折叠和超家族识别的工具。
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
4
Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection.概率多类多核学习:用于蛋白质折叠识别和远程同源性检测
Bioinformatics. 2008 May 15;24(10):1264-70. doi: 10.1093/bioinformatics/btn112. Epub 2008 Mar 31.
5
Remote homology detection: a motif based approach.远程同源性检测:一种基于基序的方法。
Bioinformatics. 2003;19 Suppl 1:i26-33. doi: 10.1093/bioinformatics/btg1002.
6
A comprehensive system for evaluation of remote sequence similarity detection.一种用于评估远程序列相似性检测的综合系统。
BMC Bioinformatics. 2007 Aug 28;8:314. doi: 10.1186/1471-2105-8-314.
7
Adaptive Smith-Waterman residue match seeding for protein structural alignment.自适应 Smith-Waterman 残基匹配种子法用于蛋白质结构比对。
Proteins. 2013 Oct;81(10):1823-39. doi: 10.1002/prot.24327. Epub 2013 Aug 19.
8
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
9
Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.离散与连续蛋白质结构空间之间的交叉:对蛋白质结构自动分类及网络的见解。
PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.
10
Towards an automatic classification of protein structural domains based on structural similarity.基于结构相似性的蛋白质结构域自动分类研究
BMC Bioinformatics. 2008 Jan 31;9:74. doi: 10.1186/1471-2105-9-74.

引用本文的文献

1
SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences.SHARK 能够在不可比对和无序序列中灵敏地检测进化同源物和功能类似物。
Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9.
2
From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets.从 PDB 文件到蛋白质特征:PDBbind 和 STCRDAB 数据集的比较分析。
Med Biol Eng Comput. 2024 Aug;62(8):2449-2483. doi: 10.1007/s11517-024-03074-3. Epub 2024 Apr 16.
3
When Protein Structure Embedding Meets Large Language Models.
当蛋白质结构嵌入与大型语言模型相遇时。
Genes (Basel). 2023 Dec 23;15(1):25. doi: 10.3390/genes15010025.
4
Phylogenies from unaligned proteomes using sequence environments of amino acid residues.使用氨基酸残基的序列环境从未对齐的蛋白质组中进行系统发育分析。
Sci Rep. 2022 May 6;12(1):7497. doi: 10.1038/s41598-022-11370-x.
5
Alignment-Free Sequence Analysis and Applications.无比对序列分析及其应用
Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.
6
Benchmarking of alignment-free sequence comparison methods.无比对信息的序列比较方法的基准测试。
Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.
7
Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对:优势、应用和工具。
Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.
8
Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.蛋白质氨基酸序列的词法解码与可用性分析:一种语言学法。
PLoS One. 2012;7(11):e50039. doi: 10.1371/journal.pone.0050039. Epub 2012 Nov 21.
9
Automatic structure classification of small proteins using random forest.使用随机森林进行小蛋白的自动结构分类。
BMC Bioinformatics. 2010 Jul 1;11:364. doi: 10.1186/1471-2105-11-364.
10
Pattern-based phylogenetic distance estimation and tree reconstruction.基于模式的系统发育距离估计和树重建。
Evol Bioinform Online. 2007 Feb 25;2:359-75.