• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过通用相似性度量对生物序列和结构进行基于压缩的分类:实验评估

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.

作者信息

Ferragina Paolo, Giancarlo Raffaele, Greco Valentina, Manzini Giovanni, Valiente Gabriel

机构信息

Dipartimento di Matematica Applicazioni, Università di Palermo, Italy.

出版信息

BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.

DOI:10.1186/1471-2105-8-252
PMID:17629909
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1939857/
Abstract

BACKGROUND

Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available.

RESULTS

We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next.

CONCLUSION

UCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/4415ea2d5618/1471-2105-8-252-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/e9c1d25bda5f/1471-2105-8-252-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/dcc227cc1606/1471-2105-8-252-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/94d524141176/1471-2105-8-252-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/fdf48284e041/1471-2105-8-252-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/fbde30a8f1c9/1471-2105-8-252-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/64c0dd1a9e9d/1471-2105-8-252-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/4415ea2d5618/1471-2105-8-252-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/e9c1d25bda5f/1471-2105-8-252-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/dcc227cc1606/1471-2105-8-252-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/94d524141176/1471-2105-8-252-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/fdf48284e041/1471-2105-8-252-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/fbde30a8f1c9/1471-2105-8-252-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/64c0dd1a9e9d/1471-2105-8-252-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd12/1939857/4415ea2d5618/1471-2105-8-252-7.jpg
摘要

背景

序列相似性是生物学分类和系统发育研究中的关键数学概念。目前主要通过比对来处理。然而,比对方法对于后基因组研究似乎并不适用,因为它们不能很好地随数据集大小扩展,而且似乎仅局限于基因组和蛋白质组序列。因此,无比对相似性度量方法正被积极探索。其中,通用相似性度量(USM)备受关注。它基于柯尔莫哥洛夫复杂性的深层理论,通用性是其最显著的新颖特征。由于它只能通过数据压缩来近似,USM是一种方法而非量化两个字符串相似性的公式。USM有三种近似方法,即通用压缩差异(UCD)、归一化压缩差异(NCD)和压缩差异(CD)。在各种数据集上对它们的适用性和稳健性进行了测试,得出了首个大规模定量估计,即USM方法及其近似方法具有价值。尽管围绕USM发展了丰富的理论,但其实验评估存在局限性:仅结合少数数据压缩器对USM进行了测试,且大多是定性层面的,没有对UCD、NCD和CD进行比较,也没有将USM与现有方法(包括基于比对和非比对的方法)进行比较。

结果

我们使用25种压缩器、USM的所有三种已知近似方法以及六个与分子生物学相关的数据集,对USM方法进行了实验测试。这为该方法提供了首个系统的定量实验评估,自然地补充了现有的许多理论和初步实验结果。此外,我们将USM方法与基于比对和非比对的方法进行了比较。我们可以将实验分为两组。第一组通过ROC(接收者操作特征曲线)分析进行,旨在评估该方法区分和分类生物序列及结构的内在能力。第二组实验旨在评估两种常用分类算法,即算术平均非加权配对组方法(UPGMA)和邻接法(NJ),如何利用该方法执行其任务,通过与黄金标准对比并使用知名统计指标(即F值和划分距离)来评估它们的性能。基于这些实验,可以得出一些结论,并从中得出关于在生物数据上使用USM的新颖且有价值的指导方针。主要结论如下所述。

结论

UCD和NCD难以区分,即 across实验和数据集,它们得出的我们所使用的统计指标值几乎相同,而CD几乎总是比两者都差。在相当一部分实验、压缩器和USM近似方法选择中,UPGMA相对于NJ似乎能产生更好的分类结果,即统计指标值更好(相差10%或以上)。基于部分匹配预测(PPM)的通用数据压缩程序PPMd和用于DNA的Gencompress,是我们所使用压缩算法中表现最佳的,尽管通过统计指标衡量它们与其他算法在性能上的差异在很大程度上取决于数据集,可能并不像预期的那么大。PPMd与UCD或NCD以及UPGMA一起用于序列数据时,性能与比对方法非常接近,尽管稍差(F值相差不到2%)。然而,它能很好地随数据集大小扩展,并且可以处理序列以外的数据。总之,我们的定量分析自然地补充了USM背后丰富的理论,并支持这样的结论:该方法因其稳健性、灵活性、可扩展性以及与现有技术的竞争力而值得使用。特别是,该方法适用于所有文本格式的生物数据。软件和数据集可在补充材料网页上根据GNU GPL协议获取。

相似文献

1
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.通过通用相似性度量对生物序列和结构进行基于压缩的分类:实验评估
BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.
2
Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'.蛋白质的k字统计量比较研究:从序列到“序列空间”
BMC Bioinformatics. 2008 Sep 23;9:394. doi: 10.1186/1471-2105-9-394.
3
Protein multiple sequence alignment benchmarking through secondary structure prediction.通过二级结构预测进行蛋白质多序列比对基准测试。
Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.
4
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
5
Measuring the similarity of protein structures by means of the universal similarity metric.通过通用相似性度量来测量蛋白质结构的相似性。
Bioinformatics. 2004 May 1;20(7):1015-21. doi: 10.1093/bioinformatics/bth031. Epub 2004 Jan 29.
6
ProClust: improved clustering of protein sequences with an extended graph-based approach.ProClust:基于扩展的图形方法改进蛋白质序列聚类
Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.
7
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
8
LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.LZW-Kernel:快速内核,利用 LZW 压缩器中的变长码块对蛋白质序列进行分类。
Bioinformatics. 2018 Oct 1;34(19):3281-3288. doi: 10.1093/bioinformatics/bty349.
9
Detailed protein sequence alignment based on Spectral Similarity Score (SSS).基于光谱相似性评分(SSS)的详细蛋白质序列比对。
BMC Bioinformatics. 2005 Apr 23;6:105. doi: 10.1186/1471-2105-6-105.
10
ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification.结构比较方法的ROC和混淆分析确定了与手动蛋白质分类存在差异的主要原因。
BMC Bioinformatics. 2006 Apr 13;7:206. doi: 10.1186/1471-2105-7-206.

引用本文的文献

1
Exploring Simplicity Bias in 1D Dynamical Systems.探索一维动力系统中的简单性偏差。
Entropy (Basel). 2024 May 16;26(5):426. doi: 10.3390/e26050426.
2
AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.AlcoR:生物数据中低复杂度区域的无比对模拟、映射和可视化。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13.
3
Information Theory Opens New Dimensions in Experimental Studies of Animal Behaviour and Communication.信息论为动物行为与交流的实验研究开辟了新维度。

本文引用的文献

1
Mining, compressing and classifying with extensible motifs.使用可扩展基序进行挖掘、压缩和分类。
Algorithms Mol Biol. 2006 Mar 23;1(1):4. doi: 10.1186/1748-7188-1-4.
2
Toward automatic reconstruction of a highly resolved tree of life.迈向高分辨率生命树的自动重建。
Science. 2006 Mar 3;311(5765):1283-7. doi: 10.1126/science.1123061.
3
The use of receiver operating characteristic curves in biomedical informatics.生物医学信息学中接受者操作特征曲线的应用。
Animals (Basel). 2023 Mar 26;13(7):1174. doi: 10.3390/ani13071174.
4
Predicting phenotype transition probabilities via conditional algorithmic probability approximations.通过条件算法概率逼近预测表型转变概率。
J R Soc Interface. 2022 Dec;19(197):20220694. doi: 10.1098/rsif.2022.0694. Epub 2022 Dec 14.
5
Comparative Study on Feature Selection in Protein Structure and Function Prediction.蛋白质结构与功能预测中的特征选择比较研究。
Comput Math Methods Med. 2022 Oct 11;2022:1650693. doi: 10.1155/2022/1650693. eCollection 2022.
6
A Compression-Based Method for Detecting Anomalies in Textual Data.一种基于压缩的文本数据异常检测方法。
Entropy (Basel). 2021 May 16;23(5):618. doi: 10.3390/e23050618.
7
Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences.使用递归特征选择和随机森林提高低相似度序列的蛋白质结构分类预测。
Comput Math Methods Med. 2021 May 7;2021:5529389. doi: 10.1155/2021/5529389. eCollection 2021.
8
AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.AC2:一种使用人工神经网络和缓存哈希模型的高效蛋白质序列压缩工具。
Entropy (Basel). 2021 Apr 26;23(5):530. doi: 10.3390/e23050530.
9
Algorithmic Entropy and Landauer's Principle Link Microscopic System Behaviour to the Thermodynamic Entropy.算法熵与兰道尔原理将微观系统行为与热力学熵联系起来。
Entropy (Basel). 2018 Oct 17;20(10):798. doi: 10.3390/e20100798.
10
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.基于压缩的度量方法在灵长类基因组进化中的应用比较
Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393.
J Biomed Inform. 2005 Oct;38(5):404-15. doi: 10.1016/j.jbi.2005.02.008. Epub 2005 Apr 2.
4
ROCR: visualizing classifier performance in R.ROCR:在R语言中可视化分类器性能
Bioinformatics. 2005 Oct 15;21(20):3940-1. doi: 10.1093/bioinformatics/bti623. Epub 2005 Aug 11.
5
Nh3D: a reference dataset of non-homologous protein structures.Nh3D:非同源蛋白质结构的参考数据集。
BMC Struct Biol. 2005 Jul 12;5:12. doi: 10.1186/1472-6807-5-12.
6
Computational cluster validation in post-genomic data analysis.后基因组数据分析中的计算聚类验证
Bioinformatics. 2005 Aug 1;21(15):3201-12. doi: 10.1093/bioinformatics/bti517. Epub 2005 May 24.
7
The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.CATH结构域数据库以及相关资源Gene3D和DHS为基因组分析提供了全面的结构域家族信息。
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D247-51. doi: 10.1093/nar/gki024.
8
Sensitivity and selectivity in protein structure comparison.蛋白质结构比较中的敏感性和选择性。
Protein Sci. 2004 Mar;13(3):773-85. doi: 10.1110/ps.03328504.
9
Measuring the similarity of protein structures by means of the universal similarity metric.通过通用相似性度量来测量蛋白质结构的相似性。
Bioinformatics. 2004 May 1;20(7):1015-21. doi: 10.1093/bioinformatics/bth031. Epub 2004 Jan 29.
10
A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary.折叠空间的共识观点:结合SCOP、CATH和达利结构域词典
Protein Sci. 2003 Oct;12(10):2150-60. doi: 10.1110/ps.0306803.