基于压缩的距离度量在蛋白质序列分类中的应用：一项方法学研究。

Application of compression-based distance measures to protein sequence classification: a methodological study.

作者信息

Kocsor András, Kertész-Farkas Attila, Kaján László, Pongor Sándor

机构信息

Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged Aradi vértanúk tere 1., H-6720 Szeged, Hungary.

出版信息

Bioinformatics. 2006 Feb 15;22(4):407-12. doi: 10.1093/bioinformatics/bti806. Epub 2005 Nov 29.

DOI:10.1093/bioinformatics/bti806

PMID:16317070

Abstract

MOTIVATION

Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences.

RESULTS

We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.

摘要

动机

基于文本压缩概念构建的距离度量已用于整个基因组和线粒体基因组的比较与分类。本研究旨在探索它们在蛋白质序列分类中的效用。

结果

我们使用Lempel-Zlv和PPMZ压缩算法构建了基于压缩的距离度量（CBM），并使用最近邻或支持向量机分类方案，将其性能与Smith-Waterman算法和BLAST的性能进行比较。数据集包括SCOP蛋白质结构数据库的一个子集，用于测试远缘蛋白质的相似性，从古细菌、细菌和真核生物物种中选择的3-磷酸甘油酸激酶序列，以及人类蛋白质组的低复杂度和高复杂度序列片段。CBM值显示出对所比较序列的长度和复杂度的依赖性。在分类任务中，CBM在远缘相关蛋白质上表现尤其出色，其中由CBM和BLAST分数构建的组合度量的性能接近甚至略超过Smith-Waterman算法和两种基于隐马尔可夫模型的算法。

相似文献

Application of compression-based distance measures to protein sequence classification: a methodological study.

Bioinformatics. 2006 Feb 15;22(4):407-12. doi: 10.1093/bioinformatics/bti806. Epub 2005 Nov 29.

Application of a simple likelihood ratio approximant to protein sequence classification.

Bioinformatics. 2006 Dec 1;22(23):2865-9. doi: 10.1093/bioinformatics/btl512. Epub 2006 Nov 7.

Benchmarking protein classification algorithms via supervised cross-validation.

J Biochem Biophys Methods. 2008 Apr 24;70(6):1215-23. doi: 10.1016/j.jbbm.2007.05.011. Epub 2007 May 31.

On the quality of tree-based protein classification.

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Application of latent semantic analysis to protein remote homology detection.

Bioinformatics. 2006 Feb 1;22(3):285-90. doi: 10.1093/bioinformatics/bti801. Epub 2005 Nov 29.

Fast model-based protein homology detection without alignment.

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

Mismatch string kernels for discriminative protein classification.

Bioinformatics. 2004 Mar 1;20(4):467-76. doi: 10.1093/bioinformatics/btg431. Epub 2004 Jan 22.

Performance evaluation of a new algorithm for the detection of remote homologs with sequence comparison.

Proteins. 2002 Aug 1;48(2):367-76. doi: 10.1002/prot.10117.

FSSA: a novel method for identifying functional signatures from structural alignments.

Bioinformatics. 2005 Jul 1;21(13):2969-77. doi: 10.1093/bioinformatics/bti471. Epub 2005 Apr 28.

SimShift: identifying structural similarities from NMR chemical shifts.

Bioinformatics. 2006 Feb 15;22(4):460-5. doi: 10.1093/bioinformatics/bti805. Epub 2005 Nov 29.

引用本文的文献

Characteristic Attribute Organization System (CAOS): Identifying Classification Rules Based on Phylogenetically Organized Sequences.

Methods Mol Biol. 2024;2744:335-345. doi: 10.1007/978-1-0716-3581-0_21.

BiComp-DTA: Drug-target binding affinity prediction through complementary biological-related and compression-based featurization approach.

PLoS Comput Biol. 2023 Mar 31;19(3):e1011036. doi: 10.1371/journal.pcbi.1011036. eCollection 2023 Mar.

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Entropy (Basel). 2021 Apr 26;23(5):530. doi: 10.3390/e23050530.

Graph Theory-Based Sequence Descriptors as Remote Homology Predictors.

Biomolecules. 2019 Dec 23;10(1):26. doi: 10.3390/biom10010026.

Phylogenetics beyond biology.

Theory Biosci. 2018 Nov;137(2):133-143. doi: 10.1007/s12064-018-0264-7. Epub 2018 Jun 21.

Normalized Compression Distance of Multisets with Applications.

IEEE Trans Pattern Anal Mach Intell. 2015 Aug;37(8):1602-14. doi: 10.1109/TPAMI.2014.2375175.

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

Bioinformatics. 2015 May 1;31(9):1396-404. doi: 10.1093/bioinformatics/btv006. Epub 2015 Jan 7.

Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison.

BMC Bioinformatics. 2013 Apr 23;14:136. doi: 10.1186/1471-2105-14-136.

Network compression as a quality measure for protein interaction networks.

PLoS One. 2012;7(6):e35729. doi: 10.1371/journal.pone.0035729. Epub 2012 Jun 18.

Comparing biological networks via graph compression.

BMC Syst Biol. 2010 Sep 13;4 Suppl 2(Suppl 2):S13. doi: 10.1186/1752-0509-4-S2-S13.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于压缩的距离度量在蛋白质序列分类中的应用：一项方法学研究。

Application of compression-based distance measures to protein sequence classification: a methodological study.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献