莱文斯坦距离、序列比较与生物数据库搜索。

Levenshtein Distance, Sequence Comparison and Biological Database Search.

作者信息

Berger Bonnie, Waterman Michael S, Yu Yun William

机构信息

Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA, and also with the Department of Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.

Quantitative and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089 USA.

出版信息

IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.

DOI:10.1109/tit.2020.2996543

PMID:34257466

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8274556/

Abstract

Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.

摘要

莱文斯坦编辑距离在过去和现在都在序列比对（特别是）以及一般的生物数据库相似性搜索中发挥了核心作用。我们先回顾一下用于计算莱文斯坦距离和序列比对的动态规划算法的历史。接下来，我们描述这些算法如何演变成生物信息学中最广泛使用的软件BLAST（一个在DNA和蛋白质数据库中搜索进化相关相似性的程序）所采用的启发式方法。最近，现代基因组测序的出现及其产生的数据量导致了对局部比对问题的回归。我们最后阐述将莱文斯坦距离作为一种度量的数学公式如何使得在生物背景下对相似性搜索进行额外优化成为可能。这些现代优化是围绕生物数据库的低度量熵和分数维构建的，从而能够将生物相似性搜索加速几个数量级。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e663/8274556/d4ad3e2d9f71/nihms-1706969-f0003.jpg

相似文献

Levenshtein Distance, Sequence Comparison and Biological Database Search.

IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.

3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data.

BMC Bioinformatics. 2022 Mar 20;23(1):95. doi: 10.1186/s12859-022-04637-7.

Interpreting Sequence-Levenshtein distance for determining error type and frequency between two embedded sequences of equal length.

ArXiv. 2023 Oct 19:arXiv:2310.12833v1.

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.

BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.

The HMMER Web Server for Protein Sequence Similarity Search.

Curr Protoc Bioinformatics. 2017 Dec 8;60:3.15.1-3.15.23. doi: 10.1002/cpbi.40.

Levenshtein distance as a measure of accuracy and precision in forensic PCR-MPS methods.

Forensic Sci Int Genet. 2021 Nov;55:102594. doi: 10.1016/j.fsigen.2021.102594. Epub 2021 Sep 11.

Linear space string correction algorithm using the Damerau-Levenshtein distance.

BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):4. doi: 10.1186/s12859-019-3184-8.

SW#db: GPU-Accelerated Exact Sequence Similarity Database Search.

PLoS One. 2015 Dec 31;10(12):e0145857. doi: 10.1371/journal.pone.0145857. eCollection 2015.

Database similarity searches.

Methods Mol Biol. 2008;484:361-78. doi: 10.1007/978-1-59745-398-1_24.

Basic local alignment search tool.

J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2.

引用本文的文献

Sequence-based prioritization of i-Motif candidates in the human genome.

Front Bioinform. 2025 Aug 12;5:1657841. doi: 10.3389/fbinf.2025.1657841. eCollection 2025.

TransMA: an explainable multi-modal deep learning model for predicting properties of ionizable lipid nanoparticles in mRNA delivery.

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf307.

Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement.

J Biomed Inform. 2025 Aug;168:104865. doi: 10.1016/j.jbi.2025.104865. Epub 2025 Jun 19.

ToxiPep: Peptide toxicity prediction via fusion of context-aware representation and atomic-level graph.

Comput Struct Biotechnol J. 2025 May 28;27:2347-2358. doi: 10.1016/j.csbj.2025.05.039. eCollection 2025.

NLP-like deep learning aided in identification and validation of thiosulfinate tolerance clusters in diverse bacteria.

mSphere. 2025 Jul 29;10(7):e0002325. doi: 10.1128/msphere.00023-25. Epub 2025 Jun 17.

Assessment and Integration of Large Language Models for Automated Electronic Health Record Documentation in Emergency Medical Services.

J Med Syst. 2025 May 17;49(1):65. doi: 10.1007/s10916-025-02197-w.

Accelerating high-concentration monoclonal antibody development with large-scale viscosity data and ensemble deep learning.

MAbs. 2025 Dec;17(1):2483944. doi: 10.1080/19420862.2025.2483944. Epub 2025 Apr 1.

TCR-NP: a novel approach to prioritize T-cell Receptor repertoire network properties.

Stat Innov. 2024;1. Epub 2024 Dec 30.

: an R package for handling multiple response attempts and conducting error analysis in aphasia and related disorders.

Front Psychol. 2025 Feb 14;16:1538196. doi: 10.3389/fpsyg.2025.1538196. eCollection 2025.

Ribotyping Staphylococcus epidermidis Using Probabilistic Sequence Analysis and Levenshtein Distance Algorithm.

Curr Microbiol. 2025 Jan 10;82(2):78. doi: 10.1007/s00284-024-04057-1.

本文引用的文献

Statistical Binning for Barcoded Reads Improves Downstream Analyses.

Cell Syst. 2018 Aug 22;7(2):219-226.e5. doi: 10.1016/j.cels.2018.07.005.

HISEA: HIerarchical SEed Aligner for PacBio data.

BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.

Computational Biology in the 21st Century: Scaling with Compressive Algorithms.

Commun ACM. 2016 Aug;59(8):72-80. doi: 10.1145/2957324.

Genome-wide reconstruction of complex structural variants using read clouds.

Nat Methods. 2017 Sep;14(9):915-920. doi: 10.1038/nmeth.4366. Epub 2017 Jul 17.

The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community.

Genome Biol. 2016 Nov 25;17(1):239. doi: 10.1186/s13059-016-1103-0.

Compressive mapping for next-generation sequencing.

Nat Biotechnol. 2016 Apr;34(4):374-6. doi: 10.1038/nbt.3511.

PacBio Sequencing and Its Applications.

Genomics Proteomics Bioinformatics. 2015 Oct;13(5):278-89. doi: 10.1016/j.gpb.2015.08.002. Epub 2015 Nov 2.

Entropy-scaling search of massive biological data.

Cell Syst. 2015 Aug 26;1(2):130-140. doi: 10.1016/j.cels.2015.08.004.

Integrative clinical genomics of advanced prostate cancer.

Cell. 2015 May 21;161(5):1215-1228. doi: 10.1016/j.cell.2015.05.001.

Startups use short-read data to expand long-read sequencing market.

Nat Biotechnol. 2015 May;33(5):433-5. doi: 10.1038/nbt0515-433.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

莱文斯坦距离、序列比较与生物数据库搜索。

Levenshtein Distance, Sequence Comparison and Biological Database Search.

作者信息

Berger Bonnie, Waterman Michael S, Yu Yun William

机构信息

Quantitative and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089 USA.

出版信息

IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.

DOI:10.1109/tit.2020.2996543

PMID:34257466

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8274556/

Abstract

摘要

莱文斯坦距离、序列比较与生物数据库搜索。

Levenshtein Distance, Sequence Comparison and Biological Database Search.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

莱文斯坦距离、序列比较与生物数据库搜索。

Levenshtein Distance, Sequence Comparison and Biological Database Search.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献